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1. Fundamental concepts 


1.1 Introduction 


This course is concerned with numerical methods for the solution of mathematical problems on a computer, 
usually using floating-point arithmetic. Floating-point arithmetic, particularly the ‘IEEE Standard’, is 
covered in some detail. The mathematical problems solved by numerical methods include differentiation 
and integration, solution of equations of all types, finding a minimum value of a function, fitting curves or 
surfaces to data, etc. This course will look at a large range of such problems from different viewpoints, but 
rarely in great depth. 


‘Numerical analysis’ is a rigorous mathematical discipline in which such problems, and algorithms for their 
solution, are analysed in order to establish the condition of a problem or the stability of an algorithm and 
to gain insight into the design of better and more widely applicable algorithms. This course contains some 
elementary numerical analysis, and technical terms like condition and stability are discussed, although the 
mathematical content is kept to a minimum. The course also covers some aspects of the topic not normally 
found in numerical analysis texts, such as numerical software considerations. 


In summary, the purposes of this course are: 
(1) To explain floating-point arithmetic, and to describe current implementations of it. 


(2) To show that design of a numerical algorithm is not necessarily straightforward, even for some simple 
problems. 


(3) To illustrate, by examples, the basic principles of good numerical techniques. 


(4) To discuss numerical software from the points of view of a user and of a software designer. 


1.2 Floating-point arithmetic 
1.2.1 Overview 
Floating-point is a method for representing real numbers on a computer. 


Floating-point arithmetic is a very important subject and a rudimentary understanding of it is a pre-requisite 
for any modern numerical analysis course. It is also of importance in other areas of computer science: 
almost every programming language has floating-point data types, and these give rise to special floating- 
point exceptions such as overflow. So floating-point arithmetic is of interest to compiler writers and designers 
of operating systems, as well as to any computer user who has need of a numerical algorithm. Sections 1.2.1, 
1.2.2, 1.7, 1.13 and 3.3.2 are closely based on Goldberg’s paper, which may be consulted for further detail 
and examples. 


Until 1987 there was a lot in common between different computer manufacturers’ floating-point 
implementations, but no Standard. Since then the IEEE Standard has been adopted by some, but by no 
means all manufacturers and may yet grow in popularity. This Standard has taken advantage of the increased 
speed of modern processors to implement a floating-point model that is superior to earlier implementations 
because it is required to cater for many of the special cases that arise in calculations. The IEEE Standard 
gives an algorithm for the basic operations (+ — */,/) such that different implementations must produce the 
same results, i.e. in every bit. (In these notes the symbol * is used to denote floating-point multiplication 
whenever there might be any confusion with the symbol x, used in representing the value of a floating-point 
number.) The IEEE Standard is by no means perfect, but it is a major step forward: it is theoretically 
possible to prove the correctness of at least some floating-point algorithms when implemented on different 
machines under IEEE arithmetic. IEEE arithmetic is described in Section 1.13. 


We first discuss floating-point arithmetic in general, without reference to the IEEE Standard. 
1.2.2 General description of floating-point arithmetic 


Floating-point arithmetic is widely implemented in hardware, but software emulation can be done although 
very inefficiently. (In contrast, fixed-point arithmetic is implemented only in specialised hardware, because 
it is more efficient for some purposes.) 


In floating-point arithmetic fractional values can be represented, and numbers of greatly different magnitude, 
e.g. 109° and 10-°°. Other real number representations exist, but these have not found widespread 
acceptance to date. 


Each floating-point implementation has a base @ which, on present day machines, is typically 2 (binary), 
8 (octal), 10 (decimal), or 16 (hexadecimal). Base 10 is often found on hand-held calculators but rarely 
in fully programmable computers. We define the precision p as the number of digits (of base @) held in a 
floating-point number. If d; represents a digit then the general representation of a floating-point number is 


adpo.d,dod3 aes dy-1 x pb 


which has the value 


+(do + dB~* + doB-? +... 4+ dp_1 87 -)) Be 


where 0 < d; < f. 


If 6 = 2 and p = 24 then 0.1 cannot be represented exactly but is stored as the approximation 
1.10011001100110011001101 x 2-4. In this case the number 1.10011001100110011001101 is the significand 
(or mantissa), and —4 is the exponent. The storage of a floating-point number varies between machine 
ranges. For example, a particular computer may store this number using 24 bits for the significand, 1 bit 
for the sign (of the significand), and 7 bits for the exponent in order to store each floating-point number in 
4 bytes. Two different machines may use this format but store the significand and exponent in the opposite 
order; calculations might even produce the same answers but the internal bit-patterns in each word will be 
different. 


Note also that the exponent needs to take both positive and negative values, and there are various conventions 
for storing these. However, all implementations have a maximum and minimum value for the signed exponent, 
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usually denoted by emax and émin. 


IBM (System/370) single precision B=16 
1 7 bits 24 bits (= 6 hexadecimal digits) 
sign exponent signiticand 

IEEE single precision (e.g. Sun) B=2 
1 8 bits 23 bits 

sign exponent signiticand 


Both of the above examples have the most significant digit of the significand on the left. Some floating-point 
implementations have the most significant digit on the right, while others have sign, exponent and significand 
in a different order. 


Although not a very realistic implementation, it is convenient to use a model such as 8 = 10, p = 3 to 
illustrate examples that apply more generally. In this format, note that the number 0.1 can be represented 
as 1.00 x 107! or 0.10 x 10° or 0.01 x 10!. If the leading digit dy is non-zero then the number is said to be 
normalized; if the leading digit is zero then it is denormal (or subnormal). The normalized representation of 
any number, e.g. 1.00 x 107}, is unique. 


In a particular floating-point implementation it is possible to use normalized numbers only, except that zero 
cannot be represented at all. This can be overcome by reserving the smallest exponent for this purpose, call 
it €min — 1. Then floating-point zero is represented as 1.0 x Bemi=—?. 


1.2.3 The numerical analyst’s view of floating-point arithmetic 


Ideally, in order to use classical mathematical techniques, we would like to ignore the use of floating-point 
arithmetic. However, the use of such ‘approximate’ arithmetic is relevant, although the numerical analyst 
does not want to be concerned with details of a particular implementation of floating-point. In this course 
we take a simplified view of machine arithmetict. In general, arithmetic base and method of rounding are of 
no concern, except to note that some implementations are better than others. The floating-point precision is 
important. We will define the parameter machine epsilon, associated with the precision of each floating-point 
implementation, and then apply the same analysis for all implementations. 


Before making this powerful simplification, we consider some implications of the use of floating-point 
arithmetic which will be swept under the carpet in so doing. These points often cause problems for the 
software designer rather than the numerical analyst: : 


Floating-point is a finite arithmetic. This is not too serious, but is worth reflecting on. On most computers, 
for historical reasons, an integer and a floating-point number each occupy one word of storage, i.e. the 
same number of bits. Therefore there are the same number of representable numbers. For example, on an 
IBM mainframe a word is 32 bits so there are 25? representable integers (between —23! and +23) and 2° 
representable floating-point numbers (between —275? and +2757). 


+ Except in Sections 1.7, 1.13 and 3.3.2 where floating-point arithmetic is discussed in detail. 
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(2) 


++ 


The representable floating-point numbers, under the arithmetic operations available on the computer, do 
not constitute a fieldf. It is trivial to show this by multiplying two large numbers, the product of which is 
not representable (i.e. causes overflow). Less obvious deficiencies are more troublesome, e.g. in IBM single 
precision the number X = —1.0 x 10” is representable but —X is not. 


The representable floating-point numbers are more dense near zero. We shall see that this is reasonable, but 
leads to complications before we get very far. 


1.2.4 Overflow and underflow 


For convenience we will often use the terms overflow and underflow to denote numbers that are outside 
the range that is representable. It is worth pointing out that both overflow and underflow are hazards in 
numerical computation but in rather different ways. 


We can regard overflow as being caused by any calculation whose result is too large in absolute value to 
be represented, e.g. as a result of exponentiation or multiplication or division or, just possibly, addition or 
subtraction. This is potentially a serious problem: if we cannot perform arithmetic with oo then we must 
treat overflow as an error. In a language with no exception handling there is little that can be done other 
than terminate the program with an error condition. In some cases overflow can be anticipated and avoided. 
In Section 1.13 we will see how IEEE arithmetic deals with overflow, and makes recovery possible in many 
cases. 


Conversely, underflow is caused by any calculation whose result is too small to be distinguished from zero. 
Again this can be caused by several operations, although addition and subtraction are less likely to be 
responsible. However, we can perform ordinary arithmetic using zero (unlike oo) so underflow is less of a 
problem but more insidious: often, but not always, it is safe to treat an underflowing value as zero. There 
are several exceptions. For example, suppose a calculation involving time uses a variable time step é¢ which 
is used to update the time t (probably in a loop terminated at some fixed time) by assignments of the form 


deltat := deltat * some_positive_value 
t:= t+ deltat 


If the variable delta_t ever underflows, then the calculation may go into an infinite loop. An ‘ideal’ algorithm 
would anticipate and avoid such problems. 


It is also worth mentioning that overflow and underflow problems can often be overcome by re-scaling a 
problem to fit more comfortably into the representable number range. 
1.3 Errors and machine epsilon 


Suppose that z, y are real numbers not dangerously close to overflow or underflow. Let x* denote the 
floating-point representation of x. We define the absolute error € by 


r*7=rt+e 


and the relative error 6 by 
z*=2(1+6)=2+26 


We can write 


e=2d 

or, ifz #0, 
j= 
z 


A field is a mathematical structure in which elements of a set A obey the formal rules of ordinary arithmetic 
with respect to a pair of operators representing addition and multiplication. The concepts of subtraction 
and division are implicit in these rules. 


When discussing floating-point arithmetic, relative error seems appropriate because each number is 
represented to a similar relative accuracy. However, there is a problem when z = Q or z is very close 
to 0 so we will need to consider absolute error as well. 


When we write 2* = z(1+) it is clear that 6 depends on z. For any floating-point implementation, there 
must be a number wu such that || < u, for all z (excluding z values very close to overflow or underflow). The 
number uw is usually called the unit round off. On most computers 1* = 1. The smallest positive ¢ such that 


(1+e)* >1 


is called machine epsilon or macheps. It is often assumed that u ~ macheps. 


Let w represent any of the arithmetic operations + — +/ on real numbers. Let w* represent the equivalent 
floating-point operation. We naturally assume that 


rw" y ~ qwy. 
More specifically, we assume that 
qw*y = rwy(1 +4) (1.1) 


for some 6 (|6| < u). Equation (1.1) is very powerful and underlies backward error analysis which allows us to 
use ordinary arithmetic while considering the data to be ‘perturbed’. That is to say, approximate arithmetic 
applied to correct data can be thought of as correct arithmetic applied to approximate data. The latter idea 
is easier to deal with algebraically and leads to less pessimistic analyses. 


1.4 Error analysis 


It is possible to illustrate the ‘style’ of error analyses by means of a very simple example. Consider the 
evaluation of a function f(z) = z?. We would like to know how the error grows when a floating-point 
number is squared. 


Forward error analysis tells us about worst cases. In line with this pessimistic view we will assume that all 
floating-point numbers are inaccurately represented with approximately the same relative error, but again 
assume that numbers are not close to overflow or underflow. 


Forward error analysis 
We express the relative error in z by 
z* = 2z(1+6). 


Squaring both sides gives 
(2")? = 27(1 46)? 
= 27(14+26 + 67) 
~ 2?(1+ 26) 


since 6” is small, i.e. the relative error is approximately doubled. 


Backward error analysis 


To avoid confusion we now use p to denote the relative error in the result, i.e. we can write 


[F(z)]" = 2°(1 + p) (1.2) 
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such that |p| < u.¢ As p is small, 1+p > 0 so there must be another number f such that (1+)? = 1+) 
where [6| < |p] < u. We can now write 


[F(z)]* = 27(1 + A)? 
= f{z(1+ A)}- 


We can interpret this result by saying that the error in squaring a floating-point number is no worse 
that the error in representing the result in floating-point form. 


These results are dramatically different ways of looking at the same process. The forward error analysis tells 
us that squaring a number causes loss of accuracy. The backward error analysis suggests that this is not 
something we need to worry about, given that we have decided to use floating-point arithmetic in the first 
place. There is no contradiction between these results: they accurately describe the same computational 
process from different points of view. 


Historically, forward error analysis was developed first and led to some very pessimistic predictions about 
how numerical algorithms would perform for large problems. When backward error analysis was developed 
in the 1950s by J.H. Wilkinson, the most notable success was in the solution of simultaneous linear equations 
by Gaussian elimination (see Section 2.3). 


We will now investigate how errors can build up using the five basic floating-point operations: + — */ f, 
where ¢ denotes exponentiation. Again it would be convenient to use ordinary arithmetic but consider the 
following problem: suppose numbers z and y are exactly represented in floating-point but that the result 
of computing z * y is not exactly represented. We cannot explain where the error comes from without 
considering the properties of the particular implementation of floating-point arithmetic. 


As an example, consider double precision arithmetic on an IBM 3084. The value of macheps is approximately 
0.22 10-15. For the sake of simplicity we will assume that all numbers are represented with the same relative 
error 10735, 


We can deal most easily with multiplication. We write 


zi =21(1+6;) 

ro = 22(1+ 42) 
Then 

Zl X £23 = 2122(1 4+ 6,)(1 + 42) 

2122(1+ 6; + 62 + 6162) 
Ignoring 4,62 because it is small, the worst case is when 6, and 62 have the same sign, i.e. the relative error 
in 2} X £3 is no worse than |6,| + |d2|. Taking the IBM example, if we perform one million floating-point 
multiplications then at worst the relative error will have built up to 10°.10-15 = 10-9. 


We can easily dispose of division by using the binomial expansion to write 
1/z3 = (1/x2)(1 + 6o)7* = (1/z2)(1 — 62 + ...). 
Then, by a similar argument, the relative error in z}/z} is again no worse than |6,| + |d9|. 


We can compute zj t 7, for any integer n by repeated multiplication or division. Consequently we can argue 
that the relative error in zj] tn is no worse than n|é;|. 


This leaves addition and subtraction. Consider 


zit+25 = 21(1+ 61) + 22(1 + 62) 
= 21+ 22+ (216) + 2262) 
= 21+ 22+ (€1 + €2) 


+ Equation (1.2) comes directly from (1.1) by setting z = y, 6 = p and treating w as ‘multiply’. 
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where €1, €2 are the absolute errors in representing 21, f2 respectively. The worst case is when €, and é2 
have the same sign, i.e. the absolute error in zj + z5 is no worse than |e1| + |e2}. 


Using the fact that (—z2)" = —z2 — €2 we get that the absolute error in zj — z3 is also no worse that 
ler| + léel. 


The fact that error build-up in addition and subtraction depends on absolute accuracy, rather than 
relative accuracy, leads to a particular problem. Suppose we calculate 10 — 7 using a computer with 
macheps =~ 10~§, e.g. IBM single precision. 


V10 = 3.16228 
m = 3.14159 


V10 — x = 0.02089 


Because these numbers happen to be of similar size the absolute error in representing each is around 3 x 10-°. 
We expect that the absolute error in the result is about 6 x 10~° at worst. But as 


Zl — 2p = 2, — 22+ (€1 — €2) 
the relative error in 2} — z3 is 
€, — &2 
71-22 


which, in this case, turns out to be about 3 x 107-4. This means that the relative error in the subtraction is 
about 300 times as big as the relative error in x1 or 29. 


This problem is known as loss of significance. It can occur whenever two similar numbers of equal sign 
are subtracted (or two similar numbers of opposite sign are added), and is a major cause of inaccuracy in 
floating-point algorithms. For example, if a calculation is being performed using, say, macheps = 10-1° 
but one critical addition or subtraction causes a loss of significance with relative error, say, 10~!° then the 
relative error achieved may be only 1075. 


However, loss of significance can be quite a subtle problem as the following two examples illustrate. Consider 


the function sin z which has the series expansion 


snz=z2-—+—>- 


which converges, for any z, to a value in the range —1 <sinz < 1. 


If we attempt to sum this series of terms with alternating signs using floating-point arithmetic we can 
anticipate that loss of significance will be a problem. Using a BBC micro (macheps ~ 2 x 1071°) the 
following results may be obtained. 


Example 1 
Taking z = 2.449 the series begins 
sin z = 2.449 — 2.448020808 + 0.7341126024 — ... 


and ‘converges’ after 11 terms to 0.6385346144, with a relative error barely larger than macheps. The 
obvious loss of significance after the first subtraction does not appear to matter. 


Example 2 
Taking z = 20 the series begins 


sin z = 20 — 1333.333333 + 26666.66667 — .... 
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The 10th term, —43099804.09 is the largest and the 27th term is the first whose absolute value is less 
than 1. The series ‘converges’ after 37 terms to 0.9091075368 whereas sin 20 = 0.9129452507, giving 
a relative error of about 4 x 1073 despite the fact that no individual operation causes such a great 
loss of significance. 


It is important to remember, in explaining these effects, that a sequence of addition/subtraction operations 
is being performed. 


In Example | the evaluation of 2.449 — 2.448020808 causes loss of significance, i.e. the relative error increases 
by at least 1000 times although the absolute error is at worst doubled. Adding in the next term, 0.7341126024, 
again makes little difference to the absolute error but, because the new term is much larger, the relative 
error is reduced again. Overall, the relative error is small. 


In Example 2 the loss of significance is caused by the fact that the final result is small compared with 
numbers involved in the middle of the calculation. The absolute error in the result is the sum of the absolute 
error in each addition/subtraction. Calculations involving the largest term will contribute an absolute error 
of nearly 10-2 so we could estimate the final relative error to be approximately 10-?/0.91, which is slightly 
larger than the observed error. 


Exercise la 


In any language with floating-point arithmetic, write a program to evaluate 
n 
i 
j=l 


(a) by summing over increasing values of j, and (b) by summing over decreasing values of j. Set 
n = 100000 and 


a 1, 
z; = 1000e/j, 7 =2,3,...n 


where € ~ macheps. (You will need to discover, from documentation or otherwise, an approximate 
value of macheps for the precision that you are using.) Explain why the results of (a) and (b) are 
different. 


1.5 Solving quadratics 


Solving the quadratic equation az? + bz +c = 0 appears, on the face of it, to be a very elementary problem. 
It is also a very important one, principally because it is often encountered as part of a larger calculation. 
Particularly important is the use of quadratic equations in matrix computations in which, typically, several 
thousand quadratics might need to be solved in a straightforward calculation. 


Such applications require an algorithm for solution of a quadratic equation that is robust in the sense that it 
will not fail or give inaccurate answers for any reasonable representable coefficients a, 6 and c. We will now 
investigate how easy this is to achieve. 


It is well known that the solution of az? + bz +c = 0 can be expressed by the formula 


es —b + Vb? — 4ac 
= 5a : 
A problem arises if 6? >> |4ac| in which case one root is small. Suppose b > 0 so that the small root is given 
by 
—b+ Vb? — 4ac 
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with loss of significance in the numerator. The problem can be averted by writing 


wes —b+ Vb? — 4ac —b — Vb? — 4ac 


2a —b — Vb? — 4ac 


and simplifying to get 
—2 
eon See as (1.4) 


b+ Vb? — 4ac 


so that the similar quantities to be summed are now of the same sign. Taking a = 1, b = 100000, c= 1 as 
an example and using a BBC micro again (macheps ~ 2 x 10~1°) we get 


z = —1.525878906 x 107° using (1.3) 
zx = —1.000000000 x 10-5 using (1.4) 
and the latter is as accurate as the machine precision will allow. 
A robust algorithm must use equation (1.3) or (1.4) as appropriate in each case. 


The solution of quadratic equations illustrates that even the simplest problems can present numerical 
difficulties although, if anticipated, the difficulties may be circumvented by adequate analysis. 


Exercise 1b 


The Bessel functions Jo(z), Ji(x), Jo(z), ... satisfy the recurrence formula 
In+1(2) = (2n/z)Jn(z) — Jn-1(2) 
On a certain computer, when z = 2/11, the first two Bessel functions have the approximate values 


Jo(x) = 0.991752 
Ji(z) = 0.0905339 


where there is known to be an error of about 0.5 in the last digit. Assuming 2/z evaluates to 11 exactly, 
and using six significant decimal digits, calculate J2(z) and J3(xr) from the formula and estimate the 
approximate relative error in each. How accurately can J4(z) be calculated? (N.B. This exercise is 
designed so that the arithmetic is very simple, and a computer or calculator is unnecessary.) 


When z = 20, using 
Jo(z) = 0.167024 


Ji(z) = 0.0668331 


Ja(z) can be evaluated to the same relative accuracy as Jo(z) and J,(z). How do you account for 
this? 


1.6 Convergence and error testing 


An iterative numerical process is one in which a calculation is repeated to produce a sequence of approximate 
solutions. If the process is successful, the approximate solutions will converge. 


Convergence of a sequence is usually defined as follows. Let zo, 21, 22, ... be a sequence (of approximations) 
and let z be a number. We define 
En =In—Z. 


The sequence converges if 
lim €, = 0 
n->0o 
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for some number z, which is called the limit of the sequence. The important point about this definition is 
that convergence of a sequence is defined in terms of absolute error. In a numerical algorithm we cannot 
really test for convergence as it is an infinite process, but we can check that the error is getting smaller, and 
consequently that convergence is likely. Writers of numerical software tend to talk loosely of ‘convergence 
testing’, by which they usually mean error testing. Some subtlety is required in error testing, as we will now 
see. 


For simplicity, we now drop the suffix n. 


Suppose that a numerical algorithm is being designed for a problem with a solution x. Let £ be an 
approximation to z, and let & be an estimate of the absolute error in 2, i.e. 


E~r~zt+é. 
Note that, for a typical problem, it is unlikely that we can find the eract absolute error, otherwise we would 


be able to calculate the ezact solution from it. 


Assume that a target absolute accuracy & is specified. Then we could use a simple error test in which the 
calculation is terminated when 


lé| < ee. (1.5) 


Consider floating-point arithmetic with macheps ~ 10-1°. If z is large, say 107°, and e, = 107° then € is 
never likely to be much less than 10*, so condition (1.5) is unlikely to be satisfied even when the process 
converges. 


Despite the definition of convergence it appears that relative error would be better, although it is unsafe to 
estimate relative error in case Z = 0. Writing 6; for target relative error, we could replace (1.5) with the 
error test 


lé| < 4:12]. (1.6) 


In this case, if |z| is very small then 4;|Z| may underflow{ and then test (1.6) may never be satisfied (unless 
é is exactly zero). 


As (1.5) is useful when (1.6) is not, and vice versa, so-called mized error tests have been developed. In 
the simplest form of such a test, a target error 7 is prescribed and the calculation is terminated when the 
condition 


lé] < me(1 + [2]) (1.7) 


is satisfied. If |Z] is small 7, may be thought of as target absolute error, or if |Z| is large m, may be thought 
of as target relative error. 


A test like (1.7) is used in much modern numerical software, but it solves only part of the problem. We also 
need to consider how to estimate ¢. The simplest formula is 


En = In —Tn-1 (1.8) 
where we again use the subscript n to denote the nth approximation. Suppose an iterative method is used 


Note that the treatment of underflowed values may depend on both the floating-point architecture and the 
programming language in use. 
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to find the positive root of the equation z+ = 16 with successive approximations 


Zo =4 
z,=0.5 

zo = 2.25 

z3 = 2.25 

24 = 1.995 
z3 = 2.00001 


Ze = 1.999999998 


If (1.8) is used then the test (1.7) will cause premature termination of the algorithm with the incorrect answer 
2.25. In case this contrived example is thought to be unlikely to occur in practice, theoretical research has 
shown that such cases must always arise for a wide class of numerical methods. -A safer test is given by 


&,= lZa = Zn-1| + [Zn-1 — Zn-2| (1.9) 


but again research has shown that zp-2, Zn-1 and z, can all coincide for certain methods so (1.9) is not 
guaranteed to work. In many problems, however, confirmation of convergence can be obtained independently: 
e.g. in the example above it can be verified by computation that 2.25 does not satisfy the equation z+ = 16. 


Another important problem associated with ‘convergence detection’ is that the ‘granularity’ of a floating- 
point representation leads to some uncertainty} as to the exact location of, say, a zero of a function f(z). 
Suppose that the mathematical function f(z) is smooth and crosses the z axis at some point a. Suppose also 
that f(z) is evaluated on a particular computer for every representable number in a small interval around 
z =a. The graph of these values, when drawn at a suitably magnified scale might appear as follows. 


If we state that f(a) = 0 then there is uncertainty as to the value of a. Alternatively, if we fix a point 
a, then there is uncertainty in the value of f(a) unless a has an exact representation in floating-point. 
Note particularly, that a 1-bit change in the representation of z might produce a change of sign in the 
representation of f(z). 


This problem is easily avoided by specifying a target error that is not too close to macheps. 


1.7 Rounding error in floating-point arithmetic 


The term infinite precision is used below to denote the result of a basic floating point operation (+ — */) 
as if performed exactly and then rounded to the available precision. Note that a calculation can actually be 
performed in this way for the operations + — and +, but not for / as the result of a division may require 


7 Physicists may note that there is an analogy here with quantum physics and uncertainty principles. 
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infinitely many digits to store. However, the end result of infinite precision division can be computed by a 
finite algorithm. 


Suppose @ = 10 and p = 3. If the result of a floating-point calculation is 3.12 x 107? when the result to 
infinite precision is 3.14 x 10-7, then there is said to be an error of 2 units in the last place. The acronym 
ulp (plural ulps) is often used for this. When comparing a real number with a machine representation, it 
is usual to refer to fractions of an ulp. For example, if the number 7 (whose value is the non-terminating 
non-repeating decimal 3.1415926535...) is represented by the floating-point number 3.14 x 10° then the 
error may be described as 0.15926535... ulps. The maximum error in representing any accurately known 
number (except near to underflow or overflow) is 1/2 ulp. 


The numerical analyst does not want to be bothered with concepts like ‘ulp’ because the error in a calculation, 
when measured in ulps, varies in a rather haphazard way and is different from machine range to machine 
range. The numerical analyst is more interested in relative error. An interesting question is: how does the 
relative error in an operation vary between different floating-point implementations? This problem is ‘swept 
under the carpet’ in most numerical analysis. Subtraction of similar quantities is a notorious source of error. 
If a floating-point format has parameters @ and p, and subtraction is performed using p digits, then the 
worst relative error of the result is G—1, i.e. base 16 can be 15 times less accurate than base 2. This occurs, 
for example, in the calculation 
1.000...0 x B°-—O.ppp...p x p* 


where p = §—1 if the least significant p is lost during the subtraction. This property implies that the smaller 
the base the better, and that @ = 2 is therefore optimal in this respect. Furthermore base 16, favoured by 
IBM, is a rather disadvantageous choice. 


To see how this works in practice, compare the two representations 8 = 16, p= 1 and 8 = 2, p= 4. Both 
of these require 4 bits of significand. The worst case for hexadecimal is illustrated by the representation of 
15/8. In binary, 15 is represented by 1.111 x 23 so that 15/8 is 1.111 x 2°. In hexadecimal the digits are 
normally written as 

0123456789ABCDEF 


so the number 15 is represented simply by F x 16° whereas 15/8 can be represented no more accurately 
than 2 x 16°. Although this example is based on 4-bit precision, it is generally true that 3 bits of accuracy 
can be lost in representing a number} in base 16. 


A relatively cheap way of improving the accuracy of floating-point arithmetic is to make use of extra digits, 
called guard digits, in the computer’s arithmetic unit only. If z and y are numbers represented exactly on a 
computer and x — y is evaluated using just one guard digit, then the relative error in the result is less than 
2 x macheps. Although this only applies if x and y are represented exactly, this property can be used to 
improve expression evaluation algorithms. 


In this Section the rather loosely defined term mathematically equivalent is used to refer to different 
expressions or formulae that are equivalent in true real arithmetic. 


For example, consider the evaluation of z? — y? when z ~ y, given exactly known z and y. If x? or y’ 
is inexact when evaluated then loss of significance will usually occur when the subtraction is performed. 
However, if the (mathematically equivalent) calculation (z + y)(x — y) is performed instead then the relative 
error in evaluating z — y is less than 2 x macheps provided there is at least one guard digit, because z and 
y are exactly known. Therefore serious loss of significance will not occur in the latter case. 


Careful consideration of the properties of floating-point arithmetic can lead to algorithms that make little 
sense in terms of real arithmetic. Consider the evaluation of In(1+z) for z small and positive. Suppose there 
exists a function LN(y) that returns an answer accurate to less than 1/2 ulp if y is accurately represented. If 
z < macheps then 1+ z2 evaluates to 1 and LN(1) will presumably return 0. A more accurate approximation 


There is a compensating advantage in using base 16: the exponent range is greater, but this is not really as 
useful as increased accuracy. 
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in this case is In(1 + +) ~ z. However there is also a problem if z > macheps but z is still small, because 
several digits of z will be lost when it is added to 1. In this case, In(1+z) ~ LN(1+72z) is less accurate than 
the (mathematically equivalent) formula 


In(t+2)~2*2IN(14+2)/((1+2)—-1). 


Indeed it can be proved that the relative error is at most 5 x macheps for z < 0.75, provided at least one 
guard digit is usedt. 


The cost of a guard digit is not high in terms of processor speed: typically 2% per guard digit for a double 
precision adder. However the use of guard digits is not universal, e.g. Cray supercomputers do not have 
them. 


The idea of rounding is fairly straightforward. Consider base 10, as it is familiar, and assume p = 3. The 
number 1.234 should be rounded to 1.23 and 1.238 should be rounded to 1.24. Most computers perform 
rounding, although it is only a user-specifiable option on some machine ranges. Curiously, IBM mainframes 
have no rounding and merely chop off extra digits. 


The only point of contention is what to do about rounding in half-way cases: should 1.235 be rounded to 
1.23 or to 1.24? A common method is to specify that the digit 5 rounds up, on the grounds that half of 
the digits round down and the other half round up. This is used, for example, on DEC VAX computers. 
Unfortunately this introduces a slight bias. 


An alternative is to use ‘round to even’: the number 1.235 rounds to 1.24 because the digit 4 is even, whereas 
1.265 rounds to 1.26 for the same reason. The slight bias of the above method is removed. Reiser & Knuth 
(1975) highlighted this by considering the sequence 


Zn = (Zn-1 —y)+y 


where n = 0,1,2,... and the calculations are performed in floating point arithmetic. Using ‘round to even’ 
they showed that z, = z; for all n, for any starting values zp and y. This is not true of the ‘5 rounds up’ 
method. For example, let @ = 10, p = 3 and choose zp = 1.00 and y = —5.55 x 107}. This leads to the 
sequence z) = 1.01, z2 = 1.02, z3 = 1.03, ... with 1 ulp being added at each stage. Using ‘round to even’, 
Zn = 1.00 for all n. 


It is also worth mentioning that probabilistic rounding has been suggested, i.e. rounding half-way cases up 
or down at random. This is also unbiased and has the advantage of reducing the overall standard deviation 
of errors, so could be considered more accurate, but it has the disadvantage that repeatability of calculations 
is lost. It is not implemented on any modern hardware. 


Finally, it is often argued that many of the disadvantages of floating-point arithmetic can be overcome by 
using arbitrary (or variable) precision rather than a fixed precision. While there is some truth in this claim, 
efficiency is important and arbitrary precision algorithms are very often prohibitively expensive. This subject 
is not discussed here. 


Exercise 1c 


Explain the term infinite precision. Consider a floating-point implementation with 6 = 10, p = 2. 
What are the infinite precision results of the calculations 


(a) (1.2 x 10*) « (1.2 x 10?) (b) (2.0 x 10°)/(3.0 x 10°) 


How many decimal guard digits are required in the significand of the arithmetic unit to calculate (b) 
to infinite precision? 


+ See Goldberg’s paper for an explanation of how this works. 
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1.8 Norms 


Up to now we have discussed only scalar problems, i.e. problems with a single number ¢z as solution. The 
more important practical numerical problems often have a vector x = {x1, 22, ...tn} of solutions}. Although 
this is a major complication, it is surprising how often a numerical method for an n-dimensional problem 
can be derived from a scalar method simply by making the appropriate generalisation in the algebra. When 
programming a numerical algorithm for a vector problem it is still desirable to use an error test like (1.7) in 
the last section, but it is necessary to generalise this by introducing a measurement of the size of a vector. 


In order to compare the size of vectors we need a scalar number, called the norm, which has properties 
appropriate to the concept of ‘size’: 


(1) The norm should be positive, unless the vector is {0,0, ...0} in which case it should be 0. 
(2) If the elements of the vector are made systematically smaller or larger then so should the norm. 


(3) If two vectors are added or subtracted then the norm of the result should be suitably related to the 
norms of the original vectors. 


The norm of a vector x is usually written ||x|| and is any quantity satisfying the three axioms: 
(1) ||x|| > 0 for any x. ||x||=0 <> x=0= {0,0,...0}. 
(2) [lcx|| = |e|.||x|] where c is any scalar. 
(3) [|x + yl] < [|x|] + lly|| where x and y are any two vectors. 


The norms most commonly used are the so-called /, norms in which, for any real number p > 1, 


ls 2 

P 

lixilp = {5- lea}. 
i=l 


In practice, only the cases p = 1 or 2 or limp co ||x||p are used. The relevant formulae are: 


|x|]. = SS lzil lynorm 


f=1 


\|x]l2 = lgnorm 


[|<|loo = ss |z;| log norm 


Alternative names for the /2 norm are: Euclidean norm, Euclidean length, length, mean. 
Alternative names for the /,, norm are: max norm, uniform norm, Chebyshev norm. 


Consider the following example from an iterative process with a vector of solutions: 


x1 = {1.04, 2.13, 2.92, -1.10} 
x2 = {1.05, 2.03, 3.04, —1.04} 
x3 = {1.01, 2.00, 2.99, —1.00}. 


This appears to be converging to something like {1,2,3,—1}, but we need to be able to test for this in a 
program. If we use, for example, a vector generalisation of (1.8), i.e. 6p = Xn — Xn-1 then we get 


&2 = {0.01, —0.10, 0.12, 0.06} 
&3 = {—0.04, —0.03, 0.05, 0.04}. 


t Note that rz, does not have the same meaning as in the last section. 
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The norms of these error vectors are as follows 


ly ee 
€2 0.29 0.17 0.12 
é3 0.16 0.08 0.05 


Note that in each case ||€3|| < |[€2|| so it does not matter, for ‘convergence testing’ purposes, which norm is 
used. 


A simple vector generalisation of the error test (1.7) is 


Iléll < me(2 + [ll]) 


where 7 is the prescribed (scalar) error and any suitable norm is used. (Of course, the same norm must be 
used for both {]é|| and ||x|| so that they can be compared.) 


1.9 Condition of a problem 


The condition of a numerical problem is a qualitative or quantitative statement about how easy it is to solve. 
irrespective of the algorithm used to solve it. 


As a qualitative example, consider the solution of two simultaneous linear equations. The problem may be 
described graphically by the pair of straight lines representing each equation: the solution is then the point 
of intersection of the lines. (The method of solving a pair of equations by actually drawing the lines with 
ruler and pencil on graph paper and measuring the coordinates of the solution point may be regarded as an 
algorithm that can be used.) Two typical cases are illustrated below: 


well conditioned ill conditioned 


The left-hand problem is easier to solve than the right hand one, irrespective of the graphical algorithm used. 
For example, a better (or worse) algorithm is to use a sharper (or blunter) pencil: but in any case it should 
be possible to measure the coordinates of the solution more exactly in the left-hand case than the right. 


Quantitatively, the condition K of a problem is a measure of the sensitivity of the problem to a small 
perturbation. For example (and without much loss of generality) we can consider the problem of evaluating 
a differentiable function f(z). Let z* be a point close to z. In this case K is a function of z defined as the 
relative change in f(z) caused by a unit relative change in z. That is 


K(2) = tim Wl) = #e"I/te)I 


ssz  |(z—2")/z| 
= [y [f(z) = F(z")I 
f(z) Roe jz—="| 
= |2f0) 
f(z) 
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from the definition of f’(x). 


Example 3 
Suppose f(z) = /z. We get 
z.f'(z) E z.[3/V/z] sak 
f(z) vz 2° 


So K is a constant which implies that taking square roots is equally well conditioned whatever the 
value of z, and that the relative error is reduced by half in the process. 


K(z) = 


Example 4 


Suppose now that f(z) = ;4-. In this case we get 


ee 2 FG) |. | al/0- 2) |= | 
ie) = f(z) |= 1/(1—2z) | l—zl’ 

This K(z) can get arbitrarily large for values of z close to 1 and can be used to estimate the relative 

error in f(z) for such values, e.g. if z = 1.000001 then the relative error will increase by a factor of 

about 10°. 


Exercise 1d 


Examine the condition of the problem of evaluating cos z. 


1.10 Stability of an algorithm 


If we represent the evaluation of a function by the rather informal graph 


where z = Zo and zy, = f(z). An algorithm is a particular method for solving a given problem, in this case 
the evaluation of a function. 


Alternatively, an algorithm can be thought of as a sequence of problems, i.e. a sequence of function 
evaluations. In this case we consider the algorithm for evaluating f(z) to consist of the evaluation of 
the sequence 21, 22, ...2n. We are concerned with the condition of each of the functions f1(r1), f2(z2), 
.- fn-1(2n—1) where f(z) = f;(xz;) for all z. © 


An algorithm is unstable if any f; is ill-conditioned, i.e. if any f;(z;) has condition much worse than f(z). 


Consider the example 
f(z) =Vz+1—-Vz 
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so that there is potential loss of significance when z is large. Taking z = 12345 as an example, one possible 
algorithm is 


Zo:= 2 = 12345 
Z:=2%+1 
re= V71 (1.10) 


@3:= VTZo 


f(z): = 24:= 22 — 23 


The loss of significance occurs with the final subtraction. We can rewrite the last step in the form 


fs(z3) = 22 — zg to show how the final answer depends on 23. As f3(23) = —1 we have the condition 
) = | 23fales) t3f3(z3) |= | | 
f(a) I2— 23 


from which we find K(r3) ~ 2.2 x 10* when z = 12345. Note that this is the condition of a subproblem 
arrived at during the algorithm. 


To find an alternative algorithm we write 


) = (Vari - yay Ete 


Ve+1+ Vz 
1 
~ Ve+l+ Ve 
This suggests the algorithm 
to:= 2 = 12345 
r:=29+1 


Te:=— V21 

@3:= VTo 

@4:=22+73 
f(z):= 25:= 1/24 


In this case f3(z3) = 1/(z2 + 23) giving a condition for the subproblem of 


(1.11) 


23) = | Bae | | 


which is approximately 0.5 when z = 12345, and indeed in any case where z is much larger than 1. 


aaa 


Thus algorithm (1.10) is unstable and (1.11) is stable for large values of z. In general such analyses are 
not usually so straightforward but, in principle, stability can be analysed by examining the condition of a 
sequence of subproblems. 


Exercise le 


Suppose that a function In is available to compute the natural logarithm of its argument. Consider 
the calculation of In(1+ 2), for small z, by the following algorithm 


To s= 2 
Zi :=I+1 
f(x) := 22 := In(z1) 


By considering the condition K(z1) of the subproblem of evaluating In(z,), show that such a function 
In is inadequate for calculating In(1+ z) accurately. 


17 


1.11 Order of convergence 


Let zo, 21, 22, ... be a sequence of successive approximations in the solution of a numerical problem. We 
define the absolute error in the nth approximation by 


In =~ I+En. 
If there exist constants p and C such that 


5 €. 1 
lim | sa 
n>oo| eF 


=C. 


then the process has order of convergence p, where p > 1. This is often expressed as 
lentil = O(len|?) 
using the O-notation. We can say that order p convergence implies that 
lentil = Clen|? 
for sufficiently large n. Obviously, for p = 1 it is required that C < 1, but this is not necessary for p > 1. 


Various rates of convergence are encountered in numerical algorithms, the most important of which are as 
follows. 


p = 1: linear convergence. Each interation produces the same reduction in absolute error. This is 
generally regarded as being too slow for practical methods. 


p = 2: quadratic convergence. Each iteration squares the absolute error, which is very satisfactory 
provided the error is small. This is sometimes regarded as the ‘ideal’ rate of convergence. 


1 < p < 2: superlinear convergence. This is not as good as quadratic but, as the reduction in 
error increases with each interation, it may be regarded as the minimum acceptable rate for a useful 
algorithm. 


p > 2. Such methods are unusual, mainly because they are restricted to very smooth functions and 
often require considerably more computation than quadratic methods. 


Exponential rate of convergence. This is an interesting special case. 


It is worth mentioning a technique known as convergence acceleration. This is often used to improve linear 
convergence to a quadratic rate or better. It is useful when the convergence acceleration method itself 
requires less computation than using a higher order method. Such techniques are Aitken’s 6? process and 
Wynn’s € algorithm which are beyond the scope of this course. 


1.12 Computational complexity 


The ideal algorithm should not only be stable, and have a fast rate of convergence, but should also have a 
reasonable computational complerity. It is possible to devise a stable algorithm that has, say, a quadratic 
rate of convergence but is still too slow to be practical for large problems. Suppose that a problem involves 
computations on a matrix of size N x N. If the computation time increases rapidly as N increases then the 
algorithm will be unsuitable for calculations with large matrices. 


Suppose that some operation, call it ©, is the most expensive in a particular algorithm. If the time spent 
by the algorithm may be expressed as O[f(N)] operations of type ©, then we say that the computational 
complezity is f(N). 


In matrix calculations, the most expensive operations are multiplication and array references. For this 
purpose the operation © may be taken to be the combination of a multiplication and one or more array 
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references. Useful matrix algorithms typically have computational complexity N?, N° or N+. These 
effectively put limits on the sizes of matrices that can be dealt with in a reasonable time. As an example, 
consider multiplication of matrices A = (a;;) and B = (b;;) to form a product C = (c;;). This requires N? 
sums of the form 


N 
cig = y ik -b45 
k=1 


each of which requires N multiplications (plus array references). Therefore the computational complexity is 
N?.N=N3, 


Note that operations of lower complexity do not change the overall computational complexity of an algorithm. 
For example, if an N? process is performed each time an N° process is performed then, because 


O(N?) + O(N?) = O(N) 


the overall computational complexity is still N3.t 


1.13 The IEEE Floating-point Standards 


There are in fact two IEEE Standards, but in this section it is only occasionally necessary to distinguish 
between them as they are based on the same principles. 


The Standard known as IEEE 754 requires that § = 2 and specifies the precise layout of bits in both single 
precision (p = 24) and double precision (p = 53). 


The other Standard, IEEE 854, is more general and allows either @ = 2 or 8 = 10. The values of p for 
single and double precision are not exactly specified, but constraints are placed on allowable values for them. 
Base 10 is included mainly for calculators, for which there is some advantage in using the same number base 
for the calculations as for the display. However, bases 8 and 16 are not supported by the Standard. 


IEEE arithmetic is implemented by several manufacturers, e.g. Sun, but not yet by all, e.g. IBM System/370 
mainframes. 


If binary is used, one extra bit of significance can be gained because the first bit of the significand of a 
normalized number must always be 1, so need not be stored. Such implementations are said to have a hidden 
bit. Note that this device does not work for any other base. 


IBEE 754 single precision uses 1 bit for the sign (of the significand), 8 bits for the exponent, and 23 bits 
for the significand. However, p = 24 because a hidden bit is used. Zero is represented by an exponent 
of €min — 1 and a significand of all zeros. There are other special cases also, as discussed below. Single 
precision numbers are designed to fit into 32 bits and double precision numbers, which have both a higher 
precisionj and an extended exponent range, are designed to fit into 64 bits. IEEE 754 also defines two types 
of ‘extended precision’, but these are only specified in terms of minimum requirements: these are intended 
for cases where a little extra precision is essential in a calculation. The table below summarises the four 
precisions. 


If, however, the N? process was performed N? times each time the N? process was performed then the 
computational complexity would be N?.N? = N4. 
Note that double precision has more than double the precision of single precision. 
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Single Single Extended Double Double Extended 


p 24 > 32 53 > 64 

€max +127 > +1023 +1023 > +16383 

€min —126 < —1022 —1022 < —16382 
exponent width (bits) 8 >11 11 > 15 
total width (bits) 32 > 43 64 > 79 


Note that an implementation which has hardware double precision would probably use this whenever ‘single 
extended’ was appropriate, to avoid implementing all four precisions. Also note that ‘double extended’ is 
sometimes referred to as ‘80-bit format’, even though it only requires 79 bits. This is for the extremely 
practical reason that ‘double extended’ is often implemented by software emulation, rather than hardware, 
so it is safer to assume that the ‘hidden bit’ might not be hidden after all! 


The IEEE Standard uses 1 bit for the sign of the significand and the remaining bits for its magnitudet. 


Exponents are stored using a biased representation. If e is the exponent of a floating-point number, then the 
bit-pattern of the stored exponent has the value e + e€max when it is interpreted as an unsigned integer. 


Under IEEE, the operations + — * / must all be performed accurately, i.e. as in infinite precision, rounded 
to the nearest correct result, using ‘round to even’. This can be implemented efficiently, but requires a 
minimum of two guard digits plus one extra bit, i.e. three guard digits in the binary case. 


The IEEE Standard specifies that the square root and remainder operations must be correctly rounded, and 
also conversions between integer and floating-point types. The conversion between internal floating-point 
format and decimal for display purposes must be done accurately, apart from numbers close to overflow. 


The IEEE Standard does not specify that transcendental functions be exactly rounded for two reasons: 
(1) correct implementation of ‘round to even’ can in principle require an arbitrarily large amount of 
computation, and (2) no algorithmic definitions were found to be suitable across all machine rangest. 


It has been pointed out that the IEEE Standard has failed to specify how inner products of the form 


n 
> 2i-yi 
t=1 


are calculated, as incorrect answers can result if extra precision is not used. It is feasible to standardise the 
calculation of inner products, and it is to be hoped that this can be added to the Standard at some time in 
the future. 


Some older floating-point representations, e.g. that used on IBM System/370 mainframes, treat every bit 
pattern as a valid floating-point number. In some programming languages, e.g. Standard Fortran 77, there 
is no exception handling, so there is little prospect of recovering from certain kinds of numerical error. For 
example, using strictly Standard Fortran 77 on the above computer, and real arithmetic, calling the square 
root function with the argument —4 is catastrophic. An error message is printed and the program stops; 
an option does allow the program to continue after such an error, and the square root function then has to 
return some valid floating-point result — but none of them is correct! 


The ‘sign/magnitude’ method was preferred to the alternative ‘2s-complement’ method in which the 
significand is represented by the smallest non-negative number that is congruent to it modulo 2?. 

The modern trend is towards large tables from which values of the common transcendental functions can be 
interpolated, but the storage overhead makes this method unsuitable for small computers. 
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The IEEE Standard attempts to tackle this problem without assuming that exception handling is available. 
Special values (all with exponents emax +1 or emin — 1) are reserved for the special quantities +0, +00, 
denormalized numbers, and the concept of NaN (Not a Number). There are two (signed) values of zero, 
although IEEE is at pains to specify that they are indistinguishable in normal arithmetic. Operations that 
overflow yield +00, and operations that underflow yield +0. Note that other calculations can yield +00 (e.g. 
z/0) or +0 (e.g. z/o0). 

There are many NaWNs, though the significance of the different values is not standardised. To all intents 
and purposes, the user can regard all NaWN-s as identical, although system-dependent information may be 


contained in a particular NaN value. The table below shows how the various categories of number are 
stored. 


Exponent Fraction Represents 
€= Emin — 1 f=0 +0 zero 


€=€min — 1 f#0 +0.f x 2*=* denormalized numbers 


€min <€ <€max any f +1.f x 2° normalized numbers 
€ = €max + 1 f=0 +00 infinities 
€ = Cmax t+ 1 fF#0 NaN Not a Number 


Note that as a special exponent is used to denote denormalized numbers, the ‘hidden bit’ device can be used 
for these as well, except that the hidden bit is always 0 in this case. The following is a list of operations that 
produce a NaN: 


any operation involving a NaN 00/00 
oo + (—0o) z REM 0 
0 * 00 oo REM z 
0/0 Jz for ~<0 


A manufacturer may choose to use the significand of a NaN to store system specific information. If so then 
the result of any operation between a NaN and anon-NaN must be the value of the NaJN;; the result of an 
operation between two NaNs must be the value of one of them. 


An arithmetic operation involving +00 typically produces a NaN or +00, where the sign of 00 is determined 
by the usual rules of arithmetic. An exception is that +2/ +00 is defined to be +0 for any ordinary number 
z. This has advantages and disadvantages. For example, consider the evaluation of f(r) = z/(z? +1). If 
z is so large that x? evaluates to oo then f(z) evaluates to 0 instead of the (representable) approximation 
1/z. This can be turned to advantage by evaluating f(z) = 1/(r + 1/z) instead; not only does the above 
problem go away, but f(0) is correctly evaluated without having to treat it as a special case}. The only 
real disadvantage of ‘infinity arithmetic’ is that poorly constructed algorithms can produce wrong results, 
whereas they would produce error messages or raise exceptions on non-IEEE machines. 


The fact that IEEE arithmetic has two representations of zero is controversial. One justification for it is that 
1/(1/z) evaluates to z for all z. Another is that two signed zeros are sometimes useful to refer to function 
values on either side of a discontinuity; this is particularly useful in complex arithmetic but is not discussed 


In general, reducing the number of tests for special cases improves efficiency on pipelined machines or when 
optimizing compilers are used. 


21 


— 


++ 


here. A simple real arithmetic example is that log(+0) can be defined as —oo and log(—0) as a NaN to 
distinguish between underflowed cases. 


The IEEE Standard compromises by requiring that +0 = —0O in all tests. In particular the test 
‘if (cx = 0) then ...’ does not take the sign of z into account. However the sign of 0 is preserved in 
operations to the extent that, say, (—3) * (—0) evaluates to +0 and (+0)/(—3) evaluates to —0. 


The IEEE Standard includes denormal numbers mainly to ensure that the property 
z=y ifandonlyif r—y=0 


holdst for all finite z. For this to be true, denormal numbers are required to represent z — y when its value 
would otherwise underflow to +0. Also program code such as 


if (c#y) then z=1/(z—y) 


could otherwise fail due to division by zero. This self-consistency is an elegant feature of IEEE arithmetic. 


The IEEE Standard includes the concept of status flags which are set when various exceptions occur. 
Implementations are required to provide a means to read and write these from programs. Once set, a 
status flag will remain so until explicitly cleared. One use of these is to distinguish between an ordinary 
overflow and a calculation that genuinely yields oo, e.g. 1/0. 


The IEEE Standard allows for trap handlers that handle exceptions, and read and write status flags. However 
trap handlers, which are ‘procedures’, are merely recommended and not required by the Standard. 


There are five status flags corresponding to the exceptions: overflow, underflow, division by zero, invalid 
operation, and inexact. The first three are straightforward. An invalid operation is any operation that gives 
rise to a NaN when none of its operands is a NaN. An ineract exception arises simply when an operation 
cannot be evaluated exactly in floating-point, which is clearly a very common occurrence. 


It is recommended that the status flags be implemented by software for efficiency. For example, the inezact 
exception will occur very frequently so it is more efficient for software to disable hardware testing for this 
once it has occurred. If the appropriate status flag is reset then testing for the tneract exception can be 
re-enabled. 


There are various useful applications of trap handlers. For example, for computing 


n 
I]: 
i=l 


where the product may overflow or underflow at an intermediate stage, even if the result is within the 
representable range. The evaluation of this product can be implemented by means of overflow and underflow 
trap handlers. The overflow trap handler works by maintaining a count of overflows and wrapping the 
exponent of a product so that it remains within the representable range. The underflow handler does the 
converse. At the end of the calculation the result is within the representable range if the number of overflows 
is equal to the number of underflows; in this case the wrapping algorithm ensures that the final exponent is 
correct. The cost is almost nothing if no intermediate product overflows or underflows. 


In order that this sort of algorithm can be easily implemented, IEEE 754 specifies that the overflow and 
underflow handlers must be passed the offending value, as an argument, in ‘wrapped-around’ form. 


Goldberg notes that properties such as this, that are useful to make programs provable, tend to lead to 
better programming practice, even though proving large programs correct may be impractical. 

Wrapping, in this sense, means using modular arithmetic to ensure that any exponent is representable. If 
counts are kept of overflows and underflows then the true exponent can be recovered. 
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By default, IEEE arithmetic rounds to the nearest number using ‘round to even’. Three other alternatives 
are provided: round towards 0, round towards —oo, and round towards +oo. A combination of the latter 
two enables ‘interval arithmetic’ to be performed. 


The rationale for the ‘extended’ precisions is that a floating-point algorithm often requires some extra 
precision for certain operations; however modern computer instruction sets tend not to provide this facility. 
The multiplication of two numbers to produce a result of greater precision is an operation that is particularly 
useful. Specifically, for calculating (1) b? — 4ac in the quadratic formula, (2) inner products, and (3) the 
correction to an approximation in certain iterative algorithms. 


Exercise 1f 


What are the results of the following operations? 
1/00 (-0)*00 co#NaN 
(Give signed results where appropriate.) 
Exercise 1g 


Suppose the IEEE Standard were changed to permit a binary implementation with only 5 bits to 
represent each number: a sign bit, 2 bits for the exponent, and 2 bits for the precision. Assuming 
all other features of the Standard are unchanged, including the use of a hidden bit, enumerate all 32 
possible bit patterns and what they would represent. 


2. Elementary numerical methods 
2.1 Numerical differentiation 


Numerical differentiation is an ill conditioned problem. Nevertheless, it is important because many numerical 
techniques require approximations to derivatives. It is a good point to start because it introduces in a simple 
way the ideas of discretization errort (error due to approximating a continuous function by a discrete 
approximation) and rounding error (error due to floating-point representation). 


2.1.1 Finite differences 


The usual definition of the derivative f'(x) is 


= fn fle + 1) Fie) 


h—0 


f(z) 
so an obvious approximation to f’(z) is achieved by choosing a small quantity A and evaluating 


_ f(e+h)~ fe), 


F(z) (2.1) 


We say that f’ (x) is a finite difference approximation to f’(z). The only problem here is: how small should 
h be? This turns out to be critical. 


Supposing that f(z) can be differentiated at least three times, Taylor’s theorem tells us that 


h? 
fla +h) = f(z) + hf'(z) + Fp -F"(z) + O(F"). 
} Discretization error is often called truncation error, but this term is somewhat confusing. 
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This can be rearranged to give 


f(zt+h)— fle) _h 


f@j= 5 f"(2) + O(n?) (2.2) 


so that, comparing (2.1) and (2.2), the discretization error is approximately &| f"(z)| in absolute value. 


Writing [f(z)]* for the floating-point representation of f(x), the calculation involves the evaluation of 


[f(z +h)]" = f(z+h) + ectn 
[f(z)]" = f(z) + €e 


from which, using (2.1), we get 


where A is very approximately the absolute error in the representation of f(z), i.e. macheps.|f(z)|. The 
rounding error is therefore approximately macheps.|f(z)|/h. 


Note that, as h decreases, the discretization error decreases but the rounding error increases. To find the 
value of h that minimises the total absolute error, we differentiate the right-hand side of the equation 


total absolute error = Aip"(2)I + macheps1f(=)| 


with respect to h, and solve 


se"2)l - machepelf(e)| Ni: (2.3) 


It turns out that this is achieved precisely at the value of h for which 
|discretization error| = |rounding error|. However, equation (2.3) involves quantities that are not known. 
For example, we cannot know f’(z) since we are trying to calculate f’(z) in the first place. We are forced 
to make an approximation, so we assume that f(z) and f”(z) are numbers of order 1 (we sometime write 


O(1)) so we can ignore them in the equation. Then, we might as well also ignore constants, so that equation 
(2.3) becomes 


h? ~ macheps 
or 
hz Vmacheps. 


This is justifiable because it is the order of magnitude of h that is critical, not its exact value. It follows 
that the total absolute error in the computed derivative is O(./macheps). However, as we assumed that 
f(z) = O(1), it would be more realistic to use h = \/macheps.|f(z)| in practice in order that the relative 
error is O(,/macheps), under appropriate assumptions. 


Exercise 2a 


List the assumptions made in the analysis in Section 2.1.1. Construct an example for which at least 
one of these assumptions is unreasonable, and demonstrate why. Why are these assumptions made? 
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2.1.2 Second derivatives 


It is, of course, possible to approximate f(z) by taking a finite difference of values of f’(z) but, again. care 
needs to be taken in choice of the value of h. 


Suppose, for simplicity, that f(z) = O(1) and we use h = \/macheps to compute values of fi(z). The 
relative error in f’(z) is approximately h. As an example, take macheps = 10-16. We would use h = 1078 
and expect a relative error of 10~® in f’(z). 


In calculating f(z) we must now regard 1078 as if it were the rounding error, so we should use h = 1074 
as the optimum value of h, giving an approximate relative error of 1074 in f’(z). 


Fortunately, few numerical methods require more than two derivatives. Note that, if f’(z) is known to 
full floating-point precision, then f”(z) can be calculated with a relative error of about 1078; this is of 
importance in designing software interfaces for certain numerical problems. 


2.1.3 Differentiation of inexact data 


Because numerical differentiation is ill-conditioned, the problem is made much worse if the function f(z) is 
known only at a finite number of points or if the function values are known only approximately. A little 
consideration of the above analysis shows that there is no point in attempting to calculate derivatives from 
the data as it stands. The only sensible course of action is to fit a smooth curve to the data by some means. 
and to calculate derivatives of the fitted curve, as indicated in the diagram. 


It is not possible to make precise statements about the accuracy of the resulting derivatives, but the technique 
is useful in some circumstances. 


A method often used in practice is to perform a least squares fit with a cubic spline. This is one of many 
applications of cubic splines, which are described in the next section. Least squares fitting is discussed in 
Section 2.3.5. 


2.2 Splines 


We begin with the problem of interpolation, that is, to find a curve of a specified form that passes through 
a given set of data points. 


The simplest case is to find a straight line through the pair of points (z1, y1), (z2, y2). The equation of the 
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straight line is 


y=a,;r+ a9 


and the problem consists of solving 2 equations (one for each data point) in the 2 unknowns ag, a ;. We could 
say that the problem has 2 degrees of freedom. 


A related problem is to find a straight line that passes through one data point (1,41), with specified slope 
m, say. This problem still has 2 degrees of freedom because there are two items, the data point and the 
slope, that can be varied independently and there are still 2 equations in 2 unknowns. 


In fitting a quadratic curve there are 3 degrees of freedom. For a cubic curve 


y= a3z° + aor" +a;2+ a9 (2.4) 


there are 4 degrees of freedom, etc. With 20 points we could, in principle, fit a polynomial of degree 19 but 
the results might not be as we would wish, e.g. 


Low order polynomials do not have this disadvantage. Suppose we decide to interpolate data by fitting two 
cubic polynomials, p;(z) and p2(z), to different parts of the data separated at the point z =t,. We have 8 
degrees of freedom and so can fit 8 data points, e.g. 


Py (x) Pz(x) 


ty 


+ 


This is worse than using a high order polynomial. What is required is some degree of continuity, e.g. in the 
curve 


P@) p2(x) 


ty 


if we require that p(t.) = po(tx) then the curve is at least continuous. But this requirement uses up one of 
our equations, so we now have only 7 degrees of freedom. We can place further constraints as follows: 


Pi (te) = pa(te) 
Pi (te) = p2(tr) 


with an extra degree of freedom taken up by each. This gives a curve with smooth continuity but only 5 
degrees of freedom. (If we go a step further and require that p/’(t.) = p$’(t~) then the two cubics become 
identical and it is equivalent to fitting (2.4).) The point ¢, is called a knot point. It is possible to specify m 
such knots and to fit a curve consisting of m+1 separate cubics with second derivative continuity. This has 
m-+4 degrees of freedom. 


A curve made up of cubic polynomials with continuity of the function and first and second derivatives is 
called a cubic spline. Numerical methods often require at least this degree of continuity and a cubic spline 
is, in some sense, the simplest function that satisfies this requirement. 


For the purposes of numerical differentiation of inexact data, fitting a cubic spline and differentiating the 
result is probably the best that can be achieved, bearing in mind that there is an infinite choice of knot 
positions. 


2.3 Simultaneous linear equations 


In this section we consider the solution of simultaneous linear equations of the form 
Ax=b (2.5) 


where A is a given matrix of coefficients, b is a given vector, and the vector x is to be determined. We shall 
consider here only the case where A is square and at least one element of b is non-zero. In this case the 
equations have a unique solution if and only if A is a non-singular matrixt. Mathematically, the solution is 
given by 

x= Ab. 
The first point to note is that there is no need to calculate the inverse A~? explicitly because the vector 
A~*b can be calculated directly. Also, calculating A~? and then multiplying by b leads to unnecessary loss 


of accuracy. The calculation of a matrix inverse, which is discussed briefly below, is usually avoided unless 
the elements of the inverse itself are required for some purpose, e.g. in some statistical analyses. 


Note that if A is singular this means that it has no inverse. This is equivalent to saying that A has zero 
determinant or that A has a zero eigenvalue. 
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The solution of the equations (2.5) is trivial if the matrix A has either lower triangular form 


@21 422 
@31 @32 433 


Qn1 Gn2 Gn3 Ann 
or upper triangular form 
411 412 G13 .-- Gin 
@22 423. --- «Gan 
433. .-- G3n 
GQnn 


where missing coefficients are zero. Note also that if (and only if) any diagonal coefficient aj; is zero then 
the matrix A is singular and there is no unique solution. 


To see that the solution is trivial to obtain, consider the upper triangular form. In this case the last equation 
contains only one unknown, zy, that can therefore be determined by 


bn 
fn = —. 
ann 


Then, as z, is now known, the second last equation contains only the one unknown z,,_; which may therefore 
be determined, and so on. 


This so-called back substitution algorithm may be summarised as 


nr 
bj - jet. Qij2j 
aii 


rzpi= 
fori =n,n—1,...1. (Note that, by convention, the summation is over zero terms when the limits do not 
make sense.) 


For a general set of equations Ax = b the solution x is unchanged if any of the following operations is 
performed: 


(1) Multiplication of an equation by a non-zero constant. 
(2) Addition of one equation to another. 
(3) Interchange of two equations. 


The technique used for solving a general set of equations Ax = b is called Gaussian elimination and consists 
of using a combination of the operations (1), (2) and (3) to convert the equations to a trivial case, by 
convention the upper triangular form. There are a large number of different ways of achieving this. The 
strategy adopted is usually called the pivotal strategy, and we shall see that this is crucially important. 


We consider two examples. 
Example 5 


This example is a well-conditioned problem that illustrates the need for a pivotal strategy. Consider 
the 3 simultaneous equations 


011 zy 1 
101 zZ2 = 2 
11 0 23 4 


which are to be converted to upper triangular form. Note that the first equation cannot form the first 
row of the upper triangle because its first coefficient is zero. Therefore we must first interchange the 
first two equations, thus 


101 Zi 
011 tm }/=[1 
1 10 %3 


The next step is to subtract the (new) first equation from the third to get 


1 0 1 zy 2 
01 1 z2 = 1 
0 1-1 73 2 


The final step is to subtract the second equation from the third and the solution z; = 3, m2 = 


zg = —j results. 


vie 


, 


This is a fairly trivial exercise on a sheet of paper, but it indicates that a computer subroutine to deal with 
the general case is less straightforward because several coefficients may be zero at any step. A successful 
pivotal strategy must specify which of the operations (1), (2) and (3) to perform on which equations at each 
stage of the calculation. 


Example 6 


This example shows that it is not only zero coefficients that cause trouble. Consider the pair of 


equations 

0.0003 1.566 zi\_ (1.569 

0.3454 —2.436 zo} \ 1.018 
which have the eract solution z,; = 10, r2 = 1. However, if we perform the calculation on a computer 
with macheps ~ 0.5 x 1074, and multiply the first equation by 0.3454/0.0003 (~ 1151) and subtract 


from the second, we get the solution z; = 3.333, z2 = 1.001 because of loss of significance. The use 
of a small non-zero number as a pivotal value has led to an incorrect answer. 


In fact, closer examination of the loss of significance shows that it is due to the fact that the coefficient 
0.0003 is small compared with 1.566. 


Therefore a successful pivotal strategy requires some concept of the size of a coefficient relative to the other 
coefficients in the same equation. This relative size is conveniently calculated by dividing each coefficient by 
the |, norm of the corresponding row of the matrix. In other words, we scale each equation by dividing it 
by its largest coefficient (in absolute value). In Example 6, we would divide the first equation by 1.566 and 
the second by 2.436. This provides a simple automatic way of checking whether a coefficient is (relatively) 
small. 


2.3.1 Pivotal strategies 


Suppose an n x n set of linear equations is being solved by Gaussian elimination. At the Ist step there are 
n possible equations and one is chosen as pivot. This eliminates one variable and leaves n — 1 equations at 
the 2nd step. At the start of the kth step there are therefore n — k + 1 remaining equations from which to 
select a pivot. 


In partial pivoting these n — k + 1 equations are scaled, by dividing by the largest coefficient of each, then 
the pivotal equation is chosen as the one with the largest (scaled) coefficient of zx. 


In total pivoting both the pivotal equation and pivotal variable are selected by choosing the largest (unscaled) 
coefficient of any of the remaining variables. Total pivoting may be better than partial pivoting, although it 
is more expensive and so is not often used. 
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Work in the 1940s, using forward error analysis, had suggested that the error in Gaussian elimination grew at 
arate that would make the process unstable for large systems of equations. In the 1950s J.-H. Wilkinson used 
backward error analysis to show that this predicted rate of error growth was merely the worst case, and not 
one that appears to occur in practical problems. Gaussian elimination is generally stable for well-conditioned 
problems. 


Exercise 2b 


List the basic operations of Gaussian elimination, i.e. the three operations on the equations Ax = b 
that leave the solution x unchanged. Illustrate how these operations are combined, using exact 
arithmetic on the matrix 


0 1 38 
2-2 1 
4 1 0O 


Show how the method is modified for floating-point arithmetic, by using partial pivoting on the matrix 


4 1 2 
0 10-4 3 
1 10.25 4 


(In each case, omit the back substitution phase of the calculation.) 
2.3.2 Symmetric positive definite equations 


Many practical numerical problems reduce ultimately to the solution of sets of simultaneous linear equations. 
For difficult problems, e.g. the solution of a partial differential equation in a complicated 2- or 3-dimensional 
region, the resulting linear equations may be huge, say 10000 simultaneous equations. 


In general, such large systems may be arbitrarily ill-conditioned and the resulting matrix may appear to be 
singular when floating point arithmetic is used. Worse still, the relative errors may become much greater 
than 1 so that incorrect solutions will be obtained. Indeed even a 2 x 2 system can lead to inaccurate answers 
if the matrix is ‘nearly’ singular. 


However, an important class of problems is always well-conditioned, and Gaussian elimination gives reliable 
answers for any size of matrix. These are problems in which the matrix is symmetric positive definite, and 
pivoting is not needed. 


A symmetric matrix A is positive definite if it satisfies the inequality 


xT Ax >0 


a=({ 1) 
Feeney) 


= 4r? + 22,22 + 425 
= (x1 + Z2)” + 3(z? + x3) 
>0 


for any non-zero vector x. For example, if 


then 


in all cases. It is not always easy to verify this property, but an important special case is where B is any 
real non-singular matrix and A = B7B, then 


x? Ax = x’ B? Bx = (Bx)’ Bx > 0. 
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Fortunately, good algorithms for symmetric positive definite equations can detect when a matrix is not, in 
fact, positive definite so it is not essential to check this independently before attempting a calculation. 


Many important practical problems give rise to symmetric positive definite equations. 
2.3.3 Choleski factorization 


An instructive way of looking at Gaussian elimination is to consider it in terms of matrix algebra. This leads 
to a useful algorithm for the solution of symmetric positive definite equations. 


When solving Ax = b by Gaussian elimination the matrix A is transformed to an upper triangular matrix 
U by means of operations which include the permutation of rows and columns of the matrix. 


A permutation matriz, call it P, is simply a matrix that permutes another matrix when it is multiplied by 
P; its elements are all either 0 or 1 in such a way that there is exactly one 1 in each row and column. The 
unit matrix I is thus the (trivial) permutation matrix that does not change anything. 


It may be shown that Gaussian elimination is equivalent to factorizing the matrix A such that 
A=PLU 


where L is a lower triangular matrix. The equations Ax = b may then be written PLUx = b which means 
that the resulting upper triangular system may be expressed as 


Ux = L7!P7!b 
where the matrices P and L are straightforward to invert. 


Choleski factorization may be applied to symmetric positive definite matrices and differs from the PLU 
factorization in three ways: 


(1) Because symmetric positive definite equations are well conditioned no permutations are necessary so 
P =I, i.e. P can be omitted. 


(2) By symmetry it is possible to arrange that U = L’ so the factorization can be expressed in the form 
A = LL’. However, the diagonal elements of L involve taking square roots; this can be avoided. 


(3) In order to avoid taking square roots the diagonal elements of L are defined to be 1 (and so do not 
need to be stored) and the factorization used is 


A =LDL’ 
where D is a diagonal matrix. This is the Choleski factorization. 


The algorithm for solving symmetric positive definite equations is then as follows: form the Choleski 
factorization, then solve 
Ly =b 
LTx = D-ly 
2.3.4 Matrix inversion 


Suppose that the two sets of equations Ax = b and Ay = ¢ are to be solved, i.e. two sets of equations with 
the same coefficient matrix. This problem could be written 


A(x y)=(b c) 


which can be solved in one go by Gaussian elimination. Note that both the solution and the right-hand side 
are now matrices rather than vectors. In the same way, one application of Gaussian elimination is sufficient 
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to solve the matrix equation AX = C where X is an unknown matrix and C is a constant matrix of the 
same shape as X. 


In particular, if we take C to be the unit matrix I then the solution X will be A7?}. 


As stated above, it is not usually necessary or desirable to invert a matrix, but sometimes the elements of 
the inverse itself are required and Gaussian elimination is normally used as described. 


2.3.5 Linear least squares 


Problems are often encountered where it is desirable to find an n-dimensional vector, call it x,, that most 
closely satisfies (in some sense) the set of m equations 


AmnXn = bm (2.6) 


where Amp is an m x n matrix. This is an overdetermined system and is usually solved in the least squares 
sense. Define the residual vector 


lm = AmnXn — Dm. 
Then the least squares solution of (2.6) is the vector x, that minimises the sum of squares of the residuals, 
i.e. 


r?r = (Ax — b)7(Ax — b). 


The least squares solution may be shown to be the solution of the equations 
ATAx = ATb 


where A? A is square and symmetric positive definite. These so-called normal equations may be solved by 
Choleski factorization. However, since the mid 1970s, this approach has been replaced by solving directly 
the overdetermined system Ax = b by a method known as singular value decomposition}. 


Singular value decomposition is a stable method that can be used to solve almost any system of linear 
equations, yielding a unique solution. In the case that the matrix is singular, the solution will be in a lower 
dimensional space than the original problem. It should be emphasised that such solutions may or may not 
have physical meaning in a given context. The technique is most useful for non-square matrices, and even 
underdetermined problems. 


2.4 Non-linear equations 


Having dealt with linear equations, we will now look briefly at the solution of simultaneous non-linear 
equations. An example of a system of 2 equations with 2 unknowns is 


: 8 
zy +sinzz—7=0 


(2.7) 


which has a solution z1 = 9/14, z2 = 52/6, which is easily verified. However there are two other solutions 
with z2 values close to 51/2. This is an extra complication: non-linear equations do not, in general, have a 
unique solution. In fact, a set of non-linear equations can have any number of solutions; there may be no 
solutions at all, or there may be infinitely many. 


The details of this method are beyond the scope of this course. 
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Exercise 2c 


Show, graphically or otherwise, that there are exactly three pairs of solutions to the equations (2.T). 


It is generally true in numerical analysis that problems without a unique solution are more difficult. Such 
problems must usually be solved iteratively, by using a starting approximation close (in some sense) to the 
required solution. Two particular difficulties that arise are (a) convergence to the ‘wrong’ solution, and (b) 
failure to converge, sometimes because approximations oscillate between two or more solutions. 


A set of n non-linear equations in n unknowns 2), 22, ...Z2, may be written 


II 
oo 


fil, 23, oo Zn) 
fo(21,22,-.-Zn) 


fn(z1,22,---2n) =0 


which can be abbreviated to f(x) = 0 where x is the vector of unknowns and f is the vector of functions f;. 
2.4.1 Newton-Raphson iteration 
We consider first the equation f(z) = 0 where f(z) is a smooth non-linear function of a scalar r. The 


Newton-Raphson method is an iterative technique that is graphically equivalent to improving the estimated 
solution by moving along the gradient of the function f(z), evaluated at the point z. 


x+h 


ae, 
x x f(x) 
The formula may be derived from the Taylor series for f(z) from which we get 
f(z +h) = f(z) + hf’(z) + O(h?). (2.8) 


If z is an approximate solution and z +h is the exact solution then f(z +h) = 0 and we can estimate the 
value of h from (2.8) to get 


_ _ F(z) 2 
h= F(z) + O(h*). 


Writing z to represent an improved approximation we have 


ano JG) 2.9 
= #— Fay a 
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The error in this approximation is O(h?) so it has a quadratic rate of convergence. In fact this rate of 
convergence is only obtained when the approximation is sufficiently close to the solution. Note that (2.9) 
could be written in either of the alternative forms 


f= 2-[f(2))"*f(z) 
f'(z)(@ - 2) = —f(z). 


As a simple example of the use of (2.9), consider the calculation of 2. This is equivalent to solving the 
equation f(z) = 0 where f(z) = r? — 2. In this case f’(z) = 2z and (2.9) becomes 


irae. zr? —2 
2z 
= z?742 
~ Qe 
_zii1 
=5 + a 
Using z = 1 as a starting approximation, successive values of = (obtained on a BBC micro) are 1.5, 


1.41666667, 1.41421569, 1.41421356. (Squaring the last approximation gave 2 to the accuracy of the floating 
point arithmetic.) 


Exercise 2d 


The Newton-Raphson formula (2.9) was compared with 


_ hf(z) 
f(z +h) — f(z) 


Z=2 


(2.10) 


for solution of the equation z? — 5 = 0. Using a particular value of h, the following numerical results 
were obtained for the first three iterations of each method using the starting value z = 2 in each case: 


iteration method(2.9) method(2.10) 


2.000000000 2.000000000 
2.250000000 2.235988201 
2.236111111 2.236063956 
2.236067978 2.236067775 


wnro 


For the first two iterations method (2.10) was more accurate, but method (2.9) was more accurate on 
the third iteration. Which method would you expect to converge faster, and why? 


Suppose method (2.10) is used on a computer for which macheps is about 10-4. Discuss a suitable 
choice of the parameter A. 


2.4.2 Generalisation of Newton-Raphson 


Newton-Raphson iteration can be generalised for solving n non-linear equations in n unknowns, i.e. for 
solving the vector equation f(x) = 0. The derivative f’(z) generalises to a matrix of partial derivatives, the 
(z,7)th element of which is 

Ofi(x) 


02; 


for i = 1,2,...n and j = 1,2,...n. This matrix of derivatives in called the Jacobian J. As division is 
not defined for matrices, we must use one of the alternative forms of (2.9) to express the Newton-Raphson 
formula, i.e. 

x=x-J-'f 
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or, in order to express this as the solution of a set of linear equations, 
J(x —x) = -f. (2.11) 


Note that the solution x — x of (2.11) is the vector of increments to be added to x to form the new 
approximation. 


The rate of convergence of the generalised Newton iteration can also be shown to be quadratic if the 
approximation is sufficiently accurate but, unfortunately, multi-dimensional problems are more difficult and it 
is much harder to ensure convergence. Algorithms for non-linear equations usually involve Newton-Raphson 
iteration, often as a final step, but more strategy and care is needed than in the scalar problem. 


Nevertheless, to illustrate that the formula (2.11) is of some use, we solve the pair of equations (2.7). Here 


the Jacobian is 
1 cosz2 
zr zy , 
The starting approximations 2; = 1, rz = 3 yield successive approximations 


Z = 0.672016558 z2 = 2.66694639 
2, = 0.643149516 22 = 2.61895767 
2; = 0.642857175 22 = 2.61799418 
2, = 0.642857143 22 = 2.61799388 


where the last pair of values is equal to 9/14, 57/6 to the accuracy of the arithmetic. The starting values 
Zi = 1, zo = 7 lead to convergence to z; = 0.226117142, zo = 7.4430273, and starting values z,; = 1, 
r2 = 8 lead to convergence to 2; = 0.205031727, z2 = 8.20846651 corresponding to the other two solutions. 
However, it is all too easy to find starting values which do not lead to convergence. For example, taking 
a starting value in between the latter two solutions yields particularly misleading results. If z; = 0.2, 
£2 = 5/2 then successive convergents ‘settle down’ in the region of z; = 0.008, z2 = 20 although only 
the crudest of error tests would be deceived into falsely detecting convergence. There is no solution of the 
equations near these values, although a little graphical consideration will show why the algorithm tries to 
find one. 


Exercise 2e 


Investigate this problem, both graphically and by writing a short program, using any language with 
a floating-point data type, to find starting values that lead to convergence and otherwise. 


It should be noted that it is often cheaper (in terms of programmer time or computation time) to use 
numerical derivatives to estimate the Jacobian matrix J in (2.11) but the resulting loss of accuracy may 
itself cause problems. 


2.4.3 Il-conditioned problems 


We now return to the scalar non-linear equation f(z) = 0 to investigate cases in which the problem is 
ill-conditioned; these are often cases in which Newton-Raphson iteration becomes unstable. One such case 
is where f’(z) = 0 at the solution of f(z) = 0 such that a Newton-Raphson step performed at the solution 
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would result in division by zero. A typical example is shown below: 


f(x) 


Note that if the z-axis were shifted downwards there would be no solution, and if shifted upwards there 
would be two solutions. An ingenious way of overcoming the difficulty is to solve a sequence of (different) 
problems which successively get closer to the original problem. This sequence of problems is chosen such 
that each is better conditioned than the original problem. Techniques of the type are known as continuation 
methods. 


In order to solve a difficult case of the problem f(z) = 0, with a starting approximation zo, we first solve 


f(z) = Af (zo) (2.12) 


for some parameter 0 < A < 1. Note that the original problem corresponds to \ = 0 and that the problem 
A = 1 is guaranteed to have a solution (x = zo). By taking a succession of decreasing values of A and using a 
suitable robust algorithm (not Newton-Raphson) the solution is obtained to a sequence of problems that get 
nearer to the original problern. Usually, as A approaches zero, these solutions tend to the required solution. 


Note that an algorithm may be described as robust if it can be guaranteed to be successful for any valid 
data. This is often achieved at the expense of time. 


2.5 Quadrature 
2.5.1 Quadrature rules 


Quadrature is the name given to the numerical evaluation of an integral of the form 


I= a f(z)dz 


where either (or both) of a or b may possibly be infinite. An approximation of the form 


b n 
7 f(z)dz =~ >— wif (zi) 


#=0 


where the points x; are chosen such that a < ro < 21 <...,2n < }, is called a quadrature rule. The points 
x; are called the abscissae and the numbers w; are called weights. 


Quadrature rules are usually derived by the integration of an interpolating polynomial. Normally, to avoid 
loss of significance, only rules with positive weights are used. 


We define the error in a quadrature rule by 
én = 1—) wif (zi) 
i=0 
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and it is possible to obtain an expression for this from the corresponding interpolating polynomial ¢. 


We will now examine a few simple quadrature rules to see how their error terms compare. We consider the 
case where a and b are both finite. The simplest reasonable quadrature rule is obtained by taking the value 
of f(z) at the mid-point of the interval (a,6) and treating the function as if it were a constant: 


a atb b 
2 


Not surprisingly, this is called the mid-point rule. The formula, including the error term, is 


a+b, f"(é)(b—a)3 (2.13) 


T= (b-a)f(F2*) 4 


for some & where a < & < b. Alternatively we can fit a straight line between the function values at the 
end-points of the interval: 


This is the trapezium rule with the formula 
b-—a "(n)(b — a)? 
T= ——I[f(e) + f(6)] - Loess (2.14) 
2 12 
for some 7 where a < 7 < b. A spectacular improvement is obtained by a fitting a quadratic curve through 


t See Conte & de Boor. 
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the mid-point and the end-points: 


a a+b b 
This leads to Simpson’s rule which is 


r= 2S Sty a) + 46(222) + Fy - 


f° (C)(F52)§ 
90 


(2.15) 
for some ¢ wherea<¢ <b. 


This idea can be extended by, say, interpolating four equally-spaced points with a cubic polynomial. 
Quadrature rules derived from equally-spaced points are called Newton-Cotes formulae. High order Newton- 
Cotes rules are rarely used for two reasons: (i) in high order rules some weights are negative, which leads to 


instability, and (ii) methods based on high order polynomials usually have the same disadvantages as high 
order polynomial interpolation. 


A more promising approach than increasing the order of the interpolating polynomial is to use a composite 
rule. A composite rule is formed by splitting an integral into a set of panels and applying (usually) the same 


quadrature rule in each panel and summing the results. For example the composite mid-point and composite 
trapezium rules: 


a b 


Let there be n panels. Writing A for the width of each panel, these lead to the similar formulae 


Teh slot {i- 534) 


t=1 
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for the composite mid-point rule, and 


n 
i“ 
I~h}> f(a+ih) 
#=0 
for the composite trapezium rule, where the double prime on the summation indicates that the first and last 


terms are each halved. The fact that these formulae are similar helps to explain why the error terms for 
these rules are also similar. 


Replacing b ~ a by h in (2.18) and (2.14) we see that these rules each have order of convergence h® in any 


one panel. But there are n panels, and n = O(h—'), so the order of convergence of the composite rule is 
h-'h3 = h?. 


The composite Simpson rule has the formula 


I~ a f(a )+ 16) +290 Ho + ih) 443 Flas (FFA 


t=1 i=1 
which, from (2.15), is of order h—!.h5 = h4. 


In general, a single quadrature rule derived from an interpolation formula based on n abscissae is guaranteed 
to have order at least h”+}, although formulae based on specially advantageous choices of the abscissae have 
higher orders of convergence. Note that both the mid-point rule and Simpson’s rule have one higher order 
of convergence than expected; the trapezium rule has the minimum order. The maximum possible order of 
convergence for a polynomial rule based on n abscissae is h?"+!; such a rule is called a Gauss rulet and uses 
optimally chosen abscissae. These abscissae turn out to be non-equally spaced points which, incidentally, 
never include the end-points of the interval; for odd values of n the mid-point is always one of the Gauss 
abscissae. 


Because an arbitrary function f(z) is potentially expensive to calculate, the efficiency of a quadrature scheme 
is usually measured by the number of function evaluations required to estimate the integral to a specified 
accuracy. In a composite rule it is therefore an advantage for the end-points to be abscissae, because a 
function evaluation at an end-point of one panel is used again in the next panel. This also explains why 
there is so little difference between the composite mid-point and composite trapezium rules in practice. 


An interesting compromise between efficiency and high order is achieved by a Lobatto rule. This is an 
n-point quadrature rule in which the end-points are included and the remaining n — 2 abscissae are chosen 
to achieve the maximum possible order of convergence (which then turns out to be h?"~!). The importance 
of Simpson’s rule is explained by the fact that it is the simplest Lobatto rule. 


2.5.2 Summation of series 


This section gives a simple example of the use of numerical analysis to devise and tune an algorithm for 
solving a problem that is easy to pose, but difficult to solve by straightforward means. The method uses 
the mid-point formula (2.13) in order to sum a series by evaluating an integral: this may be thought of as 
quadrature in reverse}. In order to simplify (2.13) a little, first consider the integral of some function f(z) 
over an interval of unit length. Writing 


n+ 
Tn ={, * pa)da 
we see that (2.13) simplifies to 


In = H(n) + 0) 


} The derivation of Gauss rules is discussed in Conte & de Boor. 
t The method described is also related to the so-called Euler-Maclaurin summation formulae. 
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where 6, is some (unknown) value in the interval (n — $,n+ $). Let en be the error term, such that 


ith ) = Loe) mn) 


In the case where f(z) happens to be a positive decreasing function in the interval (n — $, n+ 3), we can 
certainly say that 


f"(n— 3) 
<—_.. 
len] $B 
Using the notation 
q 
Spq = SY f(r), (2.16) 
n=p 


we now consider the problem of evaluating the infinite series S,,.. where, for sufficiently large values of z, 
f(z) is a positive decreasing function of z and a formula can be found for f f(z)dz. Of particular interest 
are functions for which 5S ,.. is not easy to evaluate by simple summation alone. 


We start by devising an algorithm applicable to all suitable functions f(z), then use a particular example 
to see how the algorithm may be tuned for efficiency. Note that a sum of the form (2.16) is equivalent to 
application of the composite mid-point rule to an integral. Suppose that we start off the calculation of 51... 
by summing WN terms then use the composite mid-point rule to approximate the remainder of the sum. We 
have 

S1,c0 = 51,N + SN41,00 


= Sin + ye Un = €n) 


n=N+1 
foo] 1 oO 
= Si, +f d ee. a On 
a eo °~ 34 a ve 
fos} 
Si,00 = S1,N +f f(z)dz + En (2.17) 
N+i 


where, if f(z) is a positive decreasing function for z > N + 4, the error estimate is such that 


1 oo 


At 1 
ck O re-P 


(2.18) 


Note that, given a target absolute error ¢, if the integral in (2.17) can be evaluated then it is possible to 
determine a value of N that should achieve this prescribed error given sufficient floating-point precision. 


As a specific example of a well-known function that is hard to evaluate directly, consider the Riemann zeta 


function: 
oO 
ct } n-* 
n=1 


for any k > 1. Writing f(n) = n-*, the integral in (2.17) is 


N+ 1-—k N+4 k-1 
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and note that f(z) is positive and decreases with z. From (2.18) we get that |En| ~ NED and we 
estimate the value of N necessary to achieve the target absolute error € as 


N= [se] ae (2.19) 


The formula to be used is then 


(N+ 2) 


N 
(hay + 


n=1 


+ En (2.20) 


where |En| ~ €. 


As a numerical example, suppose k = 2 and the target absolute error « = 10-!®. With straightforward 
summation, we can use the integral f, et fe f(z)dz to estimate how many terms are needed: this works out at 


about 101°. Using (2.19) and (2.20) we can estimate the value 


2 


+ ios 
s0-7] = 


N= [ 
for our algorithm. When k = 2, if a computer takes, say, about 0.1 second for this calculation then the 
straightforward sum would take over 300 years. 


Finally, it is worth noting that the integral in (2.17) can itself be estimated by quadrature, which allows the 
method to be used even when a formula for f f(z)dz is not available. However, the quadrature rule will add 
to the computational cost of the algorithm and will complicate the error formula. 


Exercise 2f 


Implement the algorithm described in Section 2.5.2, in any language with a floating-point data type. 
Use the following known values to test its accuracy: 


(Q)=% 
¢(4) = % 


(If your floating-point implementation has poor numerical properties then you may need to compute 
the sum carefully to achieve the expected accuracy — recall Exercise la.) 


If a program timer is available compute ¢(2), say 10 times, and hence estimate how long it would take 
to compute the value once, to similar accuracy, using a straightforward sum. 


3. Numerical software 
3.1 State of the art 


For mainly historical reasons, most software for numerical methods is found in subroutine libraries designed 
to be used from conventional programming languages. This is in contrast to, say, statistical software which 
is often provided in the form of a free-standing package. There are a number of special-purpose libraries for 
single application areas, such as linear equations or optimization. Some research organisations have produced 
general-purpose libraries but these are often restricted to a particular range of computers, or have gaps in 
areas that are not of interest at the originating site. There are only two portable general purpose libraries 
currently in wide use on a world scale. These are the libraries of NAG (Numerical Algorithms Group), based 
in Oxford, and Precision Visuals (formerly IMSL), based in the United States. 
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3.2 Languages 


It is not possible to proceed very far with this subject before discussing the vexed question of the languages 
in which numerical software has been implemented. The significant languages used up to now have been: 
Algol (both Algol 60 and Algol 68), Fortran (all dialects), Pascal and Ada. In Autumn 1990 NAG released 
a small library written in C. By far the most commonly used language for numerical software is Fortran, 
for largely historical reasons. Whereas Fortran has few redeeming features as a programming language, it 
does have standardisation and portability in its favour. Also, in view of the substantial body of well-tested 
software that already exists in Fortran, it can be expected to maintain its pre-eminent position for a few 
years at least. A new Standard, Fortran 90, was standardised in 1991 and NAG has produced the first 
compiler, though not as yet accompanied by a numerical library. 


In the 1980s, it was predicted that Algol 68, Pascal or Ada would become the future language for numerical 
software, but none of these forecasts has come true. Algol 68 has virtually died out, and never was much used 
in the United States. The difficulty with standard Pascal is that 2-dimensional arrays, crucial to the handling 
of matrices in numerical subroutines, are not implemented in a suitable way. There was heavy investment in 
Ada by the N.A.T.O. countries for military purposes, but the end of the ‘Cold War’ has resulted in a rapid 
decline in interest in this language for largely non-technical reasons. 


It is to be hoped that good numerical subroutines will become widely available from good programming 
languages in the future, although this is unfortunately not the case at present. The future trend may be 
to improve facilities for mixed language programming, or possibly to move towards packaged numerical 
software. 


3.3 Floating-point arithmetic: software considerations 
3.3.1 The Brown model 


We have already seen that one implementation-specific constant, macheps, is important in numerical 
algorithms. In order to circumvent overflow or underflow in an algorithm it is also necessary to know 
the limits of the representable number range. Other constants may be significant in particular algorithms. 


It is still the case that IEEE implementations are the exception rather than the rule. The task of 
writing implementation-independent subroutines is considerably simplified by the use of a ‘lowest common 
denominator’ model of floating-point arithmetic. The Brown model}, or some modification of it, is one that 
is commonly used. The model assumes that a floating-point representation can be described by only four 
parameters: 


B - the base of the arithmetic 

N - the number of digits (of base B) in the mantissa 
EMIN - the minimum exponent 
EMAX - the maximum exponent 


then any floating-point number can be represented in the form 

+(.d,d2...dy)B® 
where d,d2...dy are digits (of base B) and E is the exponent in the range EMIN < E < EMAX. A library 
need only contain four implementation-dependent functions that supply the values of these parameters for 


the use of portable algorithms. In principle, other implementation-dependent constants can be derived from 
the parameters of the Brown model. 


+ Due to W.S. Brown, 1981. 
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3.3.2 Implementation issues for IEEE arithmetic 


IEEE arithmetic is more closely specified than Brown’s model, which was used to define floating-point 
arithmetic in the language Ada. For example under IEEE it is possible to prove that (3.0/10.0) * 10.0 
evaluates to 3.0; this is not possible under the axioms of Brown’s model. 


Under IEEE 754, because it specifies the number of bits to be used, the use of exactly rounded operations 
means that the results of calculations should be identical when a program is moved from one processor to 
another, provided they both support the Standard. Furthermore it is possible, at least in principle, to prove 
that some floating-point algorithms produce answers correct to within prescribed tolerancesj. 


Another difficulty that is encountered in programming languages, e.g. some of the early implementations of 
C (ie. Kernighan & Ritchie C), is that the programmer’s use of parentheses is not honoured on the grounds 
of efficiency! In floating-point, (x + y) + z is not the same as z+ (y+ 2) and the programmer’s parentheses 
may be inserted specifically to avoid wrong answers. This was corrected when C was Standardised by ANSI: 
nobody cares how efficiently the wrong answer can be calculated. Many floating-point algorithms fail if 
optimizing compilers assume ‘true real arithmetic’. If optimizing compilers get cleverer then worse problems 
may occur — see Section 3.2.3 of Goldberg’s paper for an example. 


Care must be taken when mixing precisions because some of the ‘invariance rules’ described above do not 
apply if precisions are mixed. For example, the number 3.0/7.0 is a recurring fraction in base 2 or 10, so 
it will have different representations in single and double precision. If a compiler evaluates everything in 
double precision, and z is in single precision, then z = 3.0/7.0 may assign the correct value to z, but the 
test ‘if (zc = 3.0/7.0) then ...’ may not work as expected. 


There may be conflicts between IEEE and language Standards. For example, one ANSI language Standard 
states that ‘any arithmetic operation whose result is not mathematically defined is prohibited’. This raises 
problems because too can be used as an ordinary value and may give rise to anomalous results, especially 
if the operation z/oo gets rid of the infinite value. 


Under default rounding, z — z is defined to evaluate to +0 for all finite z. Thus (+0) — (+0) = +0. Also 
—(+0) is defined to be ~0, so —z should not be implemented as 0 — z as this would be wrong for z = +0. 


The use of NaNs violates many of the desirable ‘invariance rules’. The test ‘if (cr = z) then ...’ yields true 
unless z is a NaN in which case it is always false by definition. Also, the test ‘if (x < y) then ...’ is not 
always the same as ‘if not(r > y) then ...’ because NaNs are unordered by definition. 


Exception handling leads to problems with synchronicity on pipelined machines and with optimizing 
compilers. Hardware and software solutions to these problems exist but, unfortunately, these are not generally 
portable to other systems. 


3.4 Problem formulation and user interfaces 


In the design of a subroutine for general use, say as a part of a numerical software library, there are a 
number of considerations besides the choice of the underlying algorithm. We will now examine some of these 
considerations by means of examples. 


N.B. Proving the correctness of a floating-point algorithm does not imply that it solves the corresponding 
mathematical problem correctly. Numerical analysis is still necessary to understand the properties and 
limitations of the method on which the algorithm is based. For example, it would be very useful to prove 
the correctness of an adaptive integration procedure (see Section 3.4.1) but, for any such procedure, there 
must exist functions for which it gives arbitrarily inaccurate answers. There is no contradiction here: the 
numerical method is a finite approximation to an infinite process, so cannot in principle be made to work 
in all cases. Proving the algorithm correct amounts to showing that the (intrinsically flawed) numerical 
method has been implemented correctly. Such a proof would be at least reassuring. Nevertheless adaptive 
integration methods are useful in practice except only in pathological cases. 
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3.4.1 Automatic quadrature 


Suppose that 


integrate(f,a,b, etat, result, error) 


is a subroutine that returns an estimate of J = ic f(z)dz in result, given a target accuracy etat to be used 
in a mixed error test. The variable error is set non-zero if any kind of error condition is detected (such as 
detecting that the integral J does not exist). 


There is a significant difference between a subroutine that evaluates, say, a certain composite quadrature 
tule and a subroutine such as integrate that performs all necessary calculations and returns an estimate 
of the integral to a prescribed accuracy. The black bor approach of the subroutine integrate is known as 
automatic quadrature. The algorithm for such a routine not only needs a suitable quadrature rule or rules, 
but also a strategy for achieving the requested accuracy and a system of bookkeeping to keep track of the 
progress of the calculation. In particular, if function evaluations are to be minimised, it is necessary to ensure 
that the function f(z) is not evaluated at the same point more than once, and also that function values are 
stored if there is a possibility that they may be required more than once. Needless to say, the work done in 
bookkeeping should be negligible compared with the function evaluations if the algorithm is to be efficient. 


Simpson’s rule, or the composite form of it, can be used as the basis of more or less sophisticated automatic 
quadrature schemes. Two such schemes will be described for the integral [ . f(z)dz. 


Simple automatic Simpson quadrature 


A simple strategy is to use the composite Simpson rule with different values of h in such a way that function 
evaluations can be re-used as much as possible. This can be done by evaluating the rule with an initial 
value of h then re-evaluating using h/2, h/4, h/8, ... until the process converges to the required accuracy. 
Consider dividing the integral into first 1 and then 2 panels: this is the simplest case but illustrates the 
principle involved. 


a at yh at})h a+3/,h b 


Writing h = b — a, the first application of Simpson’s rule uses f(a), f(a + $h) and f(b). The second 
application re-uses these 3 values and also needs f(a + $h) and f(a + 3h). The relevant formulae are 


S. = FIs (a) + 4F(a+ 5A) + 0) 
5 = tla atlas Lh) Hofled 2a) Gap led 22) +70) 
oes U 4 2 4 
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with errors given by 


tu AYS 
r-s,--E0@" 
iv AYS 
1-5 = 2A 


If we make the assumption that f*”(¢,) ~ f'”(¢2) then, by subtracting these error estimates and re-arranging, 
we get 
ied (591) 


90. 720052 — St) 


from which we get the estimate 


So—S. 
I-S,= 72. (3.1) 
15 


In other words, the difference between the two approximations is about 15 times as big as the absolute error 
in S2. Since all the function evaluations used in S; are re-used in S2, the extra work needed to provide this 
useful error estimate is negligible. 


It is clear that a successful and reasonably efficient automatic quadrature routine can be designed using 


successive halving of the interval h. However, consider how the above scheme would perform when integrating 
the following function: 


be b 


In order to achieve greater efficiency, particularly for difficult integrals, it is better to allow the strategy to 
depend on the function f(z) itself. Such an automatic scheme is said to be adaptive. 


Adaptive Simpson quadrature 


In this scheme the interval (a, b) is first split up into several equal subintervals. Only the two rules S, and So 
are used but, if the estimated contribution to the error in each subinterval is too large then that subinterval 
is further sub-divided and the process repeated. The net effect is that the final sub-intervals are not all of 
equal size, and these sizes depend on the behaviour of the integrand f(z) in that region. 


The bookkeeping has to take account of all the nested subintervals and to ensure that the whole of (a, 6) is 
covered. There is obviously a lot more work involved in ensuring that function evaluations are re-used, but 
adaptive schemes are generally more efficient for difficult problems. 


Exercise 3a 


Describe how the adaptive Simpson method could be implemented using a recursive function (used 
internally by integrate). How would you ensure that the algorithm terminates? Describe circumstances 
in which you would expect an adaptive scheme to be (a) more efficient, and (b) less efficient than a 
non-adaptive scheme. 
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3.4.2 Ordinary differential equations 


Suppose a particular algorithm can be applied to a wide class of problems. If a subroutine is to use this 
algorithm then it is highly desirable that the user interface should not prevent the solution of problems for 
which the algorithm is applicable. 


This ideal is not always easily achieved, but consider the first order differential equation 
y =f(z,y); y(a) = 


where y’ = dy/dz and a and 6 are given constants. As the numerical solution of the equation begins at 
the point z = a, this is called an initial value problem. There are several algorithms in current use for 
the solution of such equations. Using the notation z, = a+nh, yn ~ y(tn), one of the simplest is the 
Runge-Kutta scheme 


ki =hf(2n, Yn) 
ko = hf(tn thynt ky) 


1 
Ynt1 = Yn + 5 (ki + ke) 
which can be used to advance the solution by fixed steps of length h in the z direction}. 


This algorithm, and many other algorithms for this problem, can be generalised to the solution of a set of 
m simultaneous differential equations of the form 


y'=f(z,y); y(a)=b (3.2) 


where a is a value of the scalar z, b is the initial value of the vector of solutions y, and f is a vector of 
functions. This is called a first-order system of differential equations. The generalisation of the Runge-Kutta 


scheme is 
ki = hf (Zn, Yn) 


ke = Af (tn +h, yn + ki) 
1 
Yn+1 = Yo t+ 5 (ka +k) 


where k,, kz are also now vectors. Because of the similarity of the scalar and vector formulae it is usual for 
subroutines for initial-value problems to be written for general systems rather than just the scalar case. 


In fact, this algorithm is quite generally applicable. Consider, for example, the solution of the second order 
differential equation 


y+ p(z)y? =4(z); y(0) =u,y(0) =v (3.3) 
where p, q are given functions and u, v are given constants. By writing z1 = y, z2 = y’ we can write (3.3) 
as the pair of equations 
a = 22 
—p(z)z3 + 4(z) 
with initial conditions z,(0) = u, z2(0) = v. Note that this can be written in the vector form (3.2), and is 
how a first-order system. This formulation can easily be used for most initial-value problems. 


2 


3.4.3 Modular use of software. 


Given a reliable source of numerical software for a reasonably wide range of problems, it is often possible 
to solve more complex problems by combining library subroutines in a suitable way. Such ingenuity does 
not remove the need for adequate analysis of the algorithms involved but, often, a means to an end can 


This algorithm is chosen for illustrative purposes only, so the details do not matter here. The derivation of 
Runge-Kutta schemes is dealt with in Conte & de Boor. 
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be justified if the final results can be checked independently. For example, when solving equations, if the 
computed solutions can be shown to satisfy the equations (to appropriate accuracy) then the computed 
solutions can be assumed to be adequate. 


As an example, it is unlikely that a library subroutine would be found specifically for the problem of finding 
a vector x that yields 


min i f (x, t)dt. (3.4) 


However a subroutine like integrate (as described in Section 3.4.1) would usually be available in a general- 
purpose library. Suppose also that the subroutine 


minimise(z,n,g, error) 


is available as a black bor for minimising a function g(x), where x is a vector of length n. The vector x is held 
in the array z. As in integrate, the variable error indicates the success of the algorithm or otherwise. The 
user initialises the array z with a starting approximation and, if the algorithm is successful, the minimising 
vector is returned in the same array. The user also must provide code for evaluating the function g. The 
manner in which this is done depends, to some extent, on the programming language used and could be 
expected to be part of the specification of minimise. This means that the user cannot vary this interface. 
For the sake of simplicity, let us suppose that g is to be evaluated by means of a subroutine 


g(z,n, funval, error) 


where the value of g(x) is returned in the variable funval. The values of the function f are also required 
by integrate and could be supplied by a subroutine with the specification 


f(t, funval, error) 


where ¢ is the variable over which the integration is to be performed. 


The problem (3.4) can be solved by calling minimise and providing a subroutine g which itself calls integrate; 
integrate calls f. The array z, which is not a part of the specification of f, needs to be made globally 
available. Note also that, as the value of g for an arbitrary vector x depends on the success of integrate, it 
is highly desirable that g can return a value of error in case integrate fails. 


Exercise 3b 


Choose any suitable programming language and write brief user documentation on how to implement 
the method of Section 3.4.3, assuming that appropriate library software is available. 


3.5 BLAS: Basic Linear Algebra Subroutines 


Finally, we will look briefly at a method currently used to enhance the portability of numerical library 
software. 


Linear algebra is the name usually given to computations involving matrices, including the solution of 
simultaneous linear equations and eigenvalue problems. Since 1979 attempts have been made to define 
subroutine interfaces} for basic vector and matrix operations (e.g. to add two vectors, or to multiply a row 
of a matrix by a vector) in order that more complex algorithms (e.g. Choleski factorization) can be coded 
in terms of these building blocks. These have been called by the acronym BLAS: Basic Linear Algebra 
Subroutines. There are currently about 100 specified BLAS. 


Due to C. Lawson, R.J. Hanson, D. Kincaid & F. Krogh, 1979; J.J. Dongarra, J.J. du Croz, 


S.J. Hammarling & R.J. Hanson, 1988; J.J. Dongarra, J.J. du Croz, I.S. Duff & S.J. Hammarling, 1990; and 
D.S. Dodson, R.G. Grimes & J.G. Lewis, 1991. 
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The idea has two significant advantages: 
(1) New linear algebra algorithms can be implemented more easily. 


(2) On special architectures, such as vector or array processors, algorithms using BLAS can be speeded 
up by re-coding only the BLAS to take advantage of the architecture. 


There are also two (relatively minor) disadvantages worth mentioning: 


(1) In problems with a small matrix size, e.g. 2 or 3-dimensional graphics, the use of BLAS may mean that 
subroutine calls dominate the computation time. Consequently the use of BLAS may be relatively 
inefficient. (This comment is obsolescent as processors become faster.) 


(2) The detailed specification of BLAS is, to some extent, language-dependent and a de facto standard 
specification needs to be established for each appropriate language. 


In 1990 the idea of BLAS was extended by providing a portable library of higher level numerical linear 
algebra routines, called LAPACK, specifically aimed at high-performance computers. 
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